Study 1

Human content coding

library(irr)
library(readxl)
library(knitr)
##read in data
rater <- read.csv("rater.csv")
rater1 = c(3,7,12,16,21,25,29,33,37,41,45,49)
rater2=c(4,8,13,17,22,26,30,34,38,42,46,50)
rater3=c(9,18)

We will first calculate Krippendorff’s alpha reliability for nominal variables.

##Create vector for nomial krippendorff alpha
nominal.alpha = c()
for(i in rater1){
key2 <- data.frame("rater1"=as.vector(t(rater[,i])), "rater2"=as.vector(t(rater[,i+1])))
key2 <- t(as.matrix(key2))

##append alpha into vector
nominal.alpha = append(nominal.alpha,kripp.alpha(key2)$value)
}

##print results
category = c("Network","Aesthetics","Spontaneity","Authenticity","Lifelogging","Expression","Privacy","Simplicity","Cost","Chat","Integration","Reputation")

nominal.result = data.frame(Category = category, KrippendorffAlpha = nominal.alpha)
knitr::kable(nominal.result)
Category KrippendorffAlpha
Network 0.8072562
Aesthetics 0.6558704
Spontaneity 0.7108844
Authenticity 0.6320346
Lifelogging 0.8944536
Expression 1.0000000
Privacy 1.0000000
Simplicity 1.0000000
Cost 0.9546183
Chat 0.8577803
Integration 0.7897081
Reputation 0.9579416

Now, we calculate Krippendorff’s alpha for countinuous variables.

##Create vector for countinuous krippendorff alpha
cont.alpha = c()
rater1 = c(5,10,14,19,23,27,31,35,39,43,47,51)
for(i in rater1){
key2 <- data.frame("rater1"=as.vector(t(rater[,i])), "rater2"=as.vector(t(rater[,i+1])))
key2 <- t(as.matrix(key2))

##append alpha into vector
cont.alpha = append(cont.alpha,kripp.alpha(key2)$value)
}

##print results
cont.result = data.frame(Category = category, KrippendorffAlpha = cont.alpha)
knitr::kable(cont.result)
Category KrippendorffAlpha
Network 0.8805540
Aesthetics 1.0000000
Spontaneity 0.8756757
Authenticity 0.7572864
Lifelogging 0.7774194
Expression 0.8622754
Privacy 0.8736264
Simplicity 0.8293135
Cost 0.8989751
Chat 0.9347209
Integration 0.8419244
Reputation 1.0000000

Finally, we calculate Krippendorff’s alpha for nominal variables of 3 raters.

#Alpha for category 2
key2 <- data.frame("rater1"=as.vector(t(rater[,7])), "rater2"=as.vector(t(rater[,8])),"rater3"=as.vector(t(rater[,9])))
key2 <- t(as.matrix(key2))

##print results
cat("Aesthetics :",kripp.alpha(key2)$value,"\n")
## Aesthetics : 0.719488
#Alpha for category 4
key2 <- data.frame("rater1"=as.vector(t(rater[,16])), "rater2"=as.vector(t(rater[,17])),"rater3"=as.vector(t(rater[,18])))
key2 <- t(as.matrix(key2))

##print results
cat("Authenticity :",kripp.alpha(key2)$value,"\n")
## Authenticity : 0.7481627

Computer-assisted content coding

For the computer-assisted coding, we used a technique we call “phrase matching”. We have a number of phrases identified inductively by two of the authors which correspond to the twelve theoretically-relevant categories. We’ll then use a number of basic text mining procedures using the stringr package in R to identify when those phrases are used in the Path App Store features. First, let’s read in the Excel data file using the readxl package.

library(stringr)
library(readxl)

wayback<-read_excel("Wayback_Features_n34.xlsx")
wayback<-as.data.frame(wayback)

Next, we’ll read in the content categories file which contains the phrases we want to code for:

cats<-read_excel("Raw_Coding_Data.xlsx", sheet="Category Guide")
cats<-as.data.frame(cats)

When excel files are read into R, the character strings we want to code may be a different class, so convert them to character strings:

wayback$Feature.1<-as.character(wayback$Feature.1)
wayback$Feature.2<-as.character(wayback$Feature.2)
wayback$Feature.3<-as.character(wayback$Feature.3)
wayback$Feature.4<-as.character(wayback$Feature.4)
wayback$Feature.5<-as.character(wayback$Feature.5)
wayback$Feature.6<-as.character(wayback$Feature.6)
wayback$Feature.7<-as.character(wayback$Feature.7)
wayback$Feature.8<-as.character(wayback$Feature.8)
wayback$Feature.9<-as.character(wayback$Feature.9)
wayback$Feature.10<-as.character(wayback$Feature.10)
wayback$Feature.11<-as.character(wayback$Feature.11)
wayback$Feature.12<-as.character(wayback$Feature.12)
wayback$Feature.13<-as.character(wayback$Feature.13)
wayback$Feature.14<-as.character(wayback$Feature.14)

There is a variable in the Wayback data called URL, which contains the hyperlink used to collect the Path App Store descriptions. This hyperlink contains the date that the snapshot was taken in the Wayback archive – we extract that using the str_sub function from the stringr package. This will come in handy when we go to plot the results.

wayback$Date<-NA

datefn<-function(i){
  year<-str_sub(wayback$URL[i], 29, 32)
  month<-str_sub(wayback$URL[i], 33, 34)
  day<-str_sub(wayback$URL[i], 35, 36)
  date<-paste(year, "-", month, "-", day, sep="")
  return(date)
}

wayback$Date<-as.Date(unlist(lapply(1:dim(wayback)[1], datefn)))
wayback$Date<-as.Date(wayback$Date)

Next, we convert these dates to a numeric variable called Date_num. In short, this variable represents the number of days passed since the start of our data collection period, with November 15, 2010 indexed as 1. We can then quickly calculate the differences, or the number of days between observations. The mean and standard deviation of these differences (reported in the paper) are shown below.

wayback$Date_num<-as.numeric(wayback$Date-wayback$Date[1])

differences<-wayback$Date_num[2]-wayback$Date_num[1]
for(i in 3:dim(wayback)[1]){
  differences<-c(differences, wayback$Date_num[i]-wayback$Date_num[i-1])
}
## [1] "Mean: 76.8484848484848"
## [1] "SD: 41.5783306033032"

Next, we’ll need a count of the number of features contained in each observation – this will allow us to create proportional measures of our content codes later on.

wayback$nFeatures<-0
wayback$nFeatures[wayback$Feature.1!=""]<-1
wayback$nFeatures[wayback$Feature.2!=""]<-2
wayback$nFeatures[wayback$Feature.3!=""]<-3
wayback$nFeatures[wayback$Feature.4!=""]<-4
wayback$nFeatures[wayback$Feature.5!=""]<-5
wayback$nFeatures[wayback$Feature.6!=""]<-6
wayback$nFeatures[wayback$Feature.7!=""]<-7
wayback$nFeatures[wayback$Feature.8!=""]<-8
wayback$nFeatures[wayback$Feature.9!=""]<-9
wayback$nFeatures[wayback$Feature.10!=""]<-10
wayback$nFeatures[wayback$Feature.11!=""]<-11
wayback$nFeatures[wayback$Feature.12!=""]<-12
wayback$nFeatures[wayback$Feature.13!=""]<-13
wayback$nFeatures[wayback$Feature.14!=""]<-14

Then, we’ll need a variable which counts the number of features in the observation which contain a phrase from each of the twelve theoretical categories. In this code and all code to follow, the numbers correspond to the following categories:

  1. Network
  2. Aesthetics
  3. Spontaneity
  4. Authenticity
  5. Lifelogging
  6. Expression
  7. Privacy
  8. Simplicity
  9. Cost
  10. Chat
  11. Integration
  12. Reputation
wayback$nFeat_1<-0
wayback$nFeat_2<-0
wayback$nFeat_3<-0
wayback$nFeat_4<-0
wayback$nFeat_5<-0
wayback$nFeat_6<-0
wayback$nFeat_7<-0
wayback$nFeat_8<-0
wayback$nFeat_9<-0
wayback$nFeat_10<-0
wayback$nFeat_11<-0
wayback$nFeat_12<-0

Now, we just need to code the feature lists. To accomplish this, we’ll need three functions: catcleanfn, featmatchfn, and featcountfn. The first function cleans the character string for the inductively identified phrases. The format of these strings looks something like: "phrase 1", "phrase 2", "longer example phrase 1". We first split on the character ", as that will help us split appart each of the phrases. We then remove items from the resulting vectors which correspond to the separating characters in the string, such as “,”. The function then returns the cleaned list to another function.

catcleanfn<-function(i){
  templist<-unlist(str_split(cats[i,3], "\""))
  templist<-templist[templist!=""]
  templist<-templist[templist!=", "]
  templist<-templist[templist!=". "]
  templist<-templist[templist!=" "]
  templist<-templist[templist!=","]
  return(templist)
}

In the next function, we need to take the list of features for the observation, assess whether each feature contains one or more of the phrases for the topic category, and return the count of the features which match our criterion. This starts with a vector of 0’s with the same length as the number of features in the observation. We then use the logical grep to determine whether one of the listed phrases is contained in each feature. If it is, we change the 0 to a 1. Then we return the sum of that list.

featmatchfn<-function(i, nfeat, catlist){
  templist<-rep(0, nfeat)
  for(j in 1:nfeat){
    feature<-wayback[[paste("Feature.", j, sep="")]][i]
    for(k in 1:length(catlist)){
      if(grepl(catlist[k], feature, ignore.case=T)){
        templist[j]<-1
      }
    }
  }
  return(sum(templist))
}

The last function puts both of the above functions together so that we can quickly iterate over the full data set. The cat variable takes a value between 1 and 12, which corresponds to the above-listed topic categories.

featcountfn<-function(i, cat){
  catlist<-catcleanfn(cat)
  if(wayback$nFeatures[i]==0){
    return(0)
  }
  if(wayback$nFeatures[i]>0){
    return(featmatchfn(i, wayback$nFeatures[i], catlist))
  }
}

Now we can run those functions for each of the 12 categories, saving the results in the master data frame.

wayback$nFeat_1<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=1))
wayback$nFeat_2<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=2))
wayback$nFeat_3<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=3))
wayback$nFeat_4<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=4))
wayback$nFeat_5<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=5))
wayback$nFeat_6<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=6))
wayback$nFeat_7<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=7))
wayback$nFeat_8<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=8))
wayback$nFeat_9<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=9))
wayback$nFeat_10<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=10))
wayback$nFeat_11<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=11))
wayback$nFeat_12<-unlist(lapply(1:dim(wayback)[1], featcountfn, cat=12))

While the above function returns the count of the features which pertain to the category of interest, the total number of features varies from observation to observation. To account for this fluctuation, we can also measure the proportion of the features in the description which mention each topic. To accomplish this, we simply divided the above results by the total number of features in the observation, or nFeatures.

wayback$nFeat_1_prop<-wayback$nFeat_1/wayback$nFeatures
wayback$nFeat_2_prop<-wayback$nFeat_2/wayback$nFeatures
wayback$nFeat_3_prop<-wayback$nFeat_3/wayback$nFeatures
wayback$nFeat_4_prop<-wayback$nFeat_4/wayback$nFeatures
wayback$nFeat_5_prop<-wayback$nFeat_5/wayback$nFeatures
wayback$nFeat_6_prop<-wayback$nFeat_6/wayback$nFeatures
wayback$nFeat_7_prop<-wayback$nFeat_7/wayback$nFeatures
wayback$nFeat_8_prop<-wayback$nFeat_8/wayback$nFeatures
wayback$nFeat_9_prop<-wayback$nFeat_9/wayback$nFeatures
wayback$nFeat_10_prop<-wayback$nFeat_10/wayback$nFeatures
wayback$nFeat_11_prop<-wayback$nFeat_11/wayback$nFeatures
wayback$nFeat_12_prop<-wayback$nFeat_12/wayback$nFeatures

At the topic-level, here are the results of the coding procedures. N is the total number of features categorized in each topic. Then for each observation, the mean and standard deviation of features coded as a topic are presented in both raw counts and proportion.

N Count - Mean Count - SD Proportion - Mean Proportion - SD
Network 39 1.15 0.82 0.17 0.11
Aesthetics 69 2.03 1.24 0.29 0.12
Spontaneity 107 3.15 1.35 0.47 0.16
Authenticity 180 5.29 2.39 0.76 0.15
Lifelogging 47 1.38 0.95 0.19 0.09
Expression 46 1.35 1.1 0.18 0.1
Privacy 22 0.65 1.15 0.07 0.1
Simplicity 17 0.5 0.62 0.08 0.09
Cost 11 0.32 0.84 0.03 0.07
Chat 8 0.24 0.43 0.02 0.04
Integration 83 2.44 1.54 0.34 0.12
Reputation 1 0.03 0.17 0 0.01

Finally, we plotted the proportions over time for each topic in the final paper. These graphs show the trends – loess regression with proportion as the dependent variable, and numeric date as the independent variable – of topic usage over time. For readability, we elected to split the graph in two; the topics shown in each were chosen to minimize line overlap. This first graph shows the Network, Authenticity, Lifelogging, Expression, Chat, and Reputation topics.

The second graph depicts the trends of the Spontaneity, Aesthetics, Privacy, Simplicity, Cost, and Integration topics.

If you are interested in producing the same figures as presented in the paper, you can use the code below. Similar figures for the raw counts could be created by removing _prop from the variable names and adjusting the y-axis scale.

years<-seq(47, 2536, by=365)
years[3:7]<-years[3:7]+1
years[7]<-years[7]+1
yearNs<-c("2011", "2012", "2013", "2014", "2015", "2016", "2017")

bettercolors <- c(
  rgb(166, 206, 227, maxColorValue=255), #light blue
  rgb(31, 120, 180, maxColorValue=255), #dark blue
  rgb(178, 223, 138, maxColorValue=255), #light green
  rgb(51, 160, 44, maxColorValue=255), #dark green
  rgb(251, 154, 153, maxColorValue=255), #light red
  rgb(227, 26, 28, maxColorValue=255)) #dark red

linetypes <- c(1, 1, 2, 2, 6, 6)

pdf("WordMatch_Prop_A.pdf", width=13, height=13)
par(mar=c(5, 5, 0, 0))
plot(1, 1, type='n', xlim=c(0, 2536), ylim=c(0, 1.05), main='', xlab='', ylab='', axes=F)
axis(1, at=c(-9999, years, 9999), labels=c(-9999, yearNs, 9999), font.axis=2, cex.axis=1.125)
text(mean(c(0, 2536)), -0.1, "Time", font=4, cex=1.8, xpd=T)
axis(2, at=seq(0, 1, by=.1), labels=c(0, "", 0.2, "", 0.4, "", 0.6, "", 0.8, "", "1.0"), font.axis=2, cex.axis=1.125)
text(-263.1797, mean(c(0, 1.02)), "Proportion of Features", font=4, cex=1.8, xpd=T, srt=90)
lines(loess.smooth(wayback$Date_num, wayback$nFeat_1_prop, span = 0.4, family = "gaussian"), col=bettercolors[1], lwd=6, lty=linetypes[1])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_4_prop, span = 0.4, family = "gaussian"), col=bettercolors[2], lwd=6, lty=linetypes[2])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_5_prop, span = 0.4, family = "gaussian"), col=bettercolors[3], lwd=6, lty=linetypes[3])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_6_prop, span = 0.4, family = "gaussian"), col=bettercolors[4], lwd=6, lty=linetypes[4])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_10_prop, span = 0.4, family = "gaussian"), col=bettercolors[5], lwd=6, lty=linetypes[5])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_12_prop, span = 0.4, family = "gaussian"), col=bettercolors[6], lwd=6, lty=linetypes[6])
rect(373.7152, 0.92, 2162.285, 1.085)
text(mean(c(0, 2536)), 1.05, "Topic", font=4, cex=1.65)
segments(seq(-69.47945, 2605.479, length.out=6)[2]+25, 1.00, seq(-69.47945, 2605.479, length.out=6)[2]+225, 1.00, col=bettercolors[1], lwd=6, lty=linetypes[1])
segments(seq(-69.47945, 2605.479, length.out=6)[2]+25, 0.95, seq(-69.47945, 2605.479, length.out=6)[2]+225, 0.95, col=bettercolors[2], lwd=6, lty=linetypes[2])
segments(seq(-69.47945, 2605.479, length.out=6)[3]+25, 1.00, seq(-69.47945, 2605.479, length.out=6)[3]+225, 1.00, col=bettercolors[3], lwd=6, lty=linetypes[3])
segments(seq(-69.47945, 2605.479, length.out=6)[3]+25, 0.95, seq(-69.47945, 2605.479, length.out=6)[3]+225, 0.95, col=bettercolors[4], lwd=6, lty=linetypes[4])
segments(seq(-69.47945, 2605.479, length.out=6)[4]+25, 1.00, seq(-69.47945, 2605.479, length.out=6)[4]+225, 1.00, col=bettercolors[5], lwd=6, lty=linetypes[5])
segments(seq(-69.47945, 2605.479, length.out=6)[4]+25, 0.95, seq(-69.47945, 2605.479, length.out=6)[4]+225, 0.95, col=bettercolors[6], lwd=6, lty=linetypes[6])
text(seq(-69.47945, 2605.479, length.out=6)[2]+225, 1.00, cats[1, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[2]+225, 0.95, cats[4, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[3]+225, 1.00, cats[5, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[3]+225, 0.95, cats[6, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[4]+225, 1.00, cats[10, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[4]+225, 0.95, cats[12, 2], font=1, cex=1.25, pos=4)
dev.off()

pdf("WordMatch_Prop_B.pdf", width=13, height=13)
par(mar=c(5, 5, 0, 0))
plot(1, 1, type='n', xlim=c(0, 2536), ylim=c(0, 1.05), main='', xlab='', ylab='', axes=F)
axis(1, at=c(-9999, years, 9999), labels=c(-9999, yearNs, 9999), font.axis=2, cex.axis=1.125)
text(mean(c(0, 2536)), -0.1, "Time", font=4, cex=1.8, xpd=T)
axis(2, at=seq(0, 1, by=.1), labels=c(0, "", 0.2, "", 0.4, "", 0.6, "", 0.8, "", "1.0"), font.axis=2, cex.axis=1.125, las=1)
text(-263.1797, mean(c(0, 1.02)), "Proportion of Features", font=4, cex=1.8, xpd=T, srt=90)
lines(loess.smooth(wayback$Date_num, wayback$nFeat_3_prop, span = 0.4, family = "gaussian"), col=bettercolors[1], lwd=6, lty=linetypes[1])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_2_prop, span = 0.4, family = "gaussian"), col=bettercolors[2], lwd=6, lty=linetypes[2])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_7_prop, span = 0.4, family = "gaussian"), col=bettercolors[3], lwd=6, lty=linetypes[3])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_8_prop, span = 0.4, family = "gaussian"), col=bettercolors[4], lwd=6, lty=linetypes[4])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_9_prop, span = 0.4, family = "gaussian"), col=bettercolors[5], lwd=6, lty=linetypes[5])
lines(loess.smooth(wayback$Date_num, wayback$nFeat_11_prop, span = 0.4, family = "gaussian"), col=bettercolors[6], lwd=6, lty=linetypes[6])
rect(373.7152, 0.92, 2162.285, 1.085)
text(mean(c(0, 2536)), 1.05, "Topic", font=4, cex=1.65)
segments(seq(-69.47945, 2605.479, length.out=6)[2]+25, 1.00, seq(-69.47945, 2605.479, length.out=6)[2]+225, 1.00, col=bettercolors[1], lwd=6, lty=linetypes[1])
segments(seq(-69.47945, 2605.479, length.out=6)[2]+25, 0.95, seq(-69.47945, 2605.479, length.out=6)[2]+225, 0.95, col=bettercolors[2], lwd=6, lty=linetypes[2])
segments(seq(-69.47945, 2605.479, length.out=6)[3]+25, 1.00, seq(-69.47945, 2605.479, length.out=6)[3]+225, 1.00, col=bettercolors[3], lwd=6, lty=linetypes[3])
segments(seq(-69.47945, 2605.479, length.out=6)[3]+25, 0.95, seq(-69.47945, 2605.479, length.out=6)[3]+225, 0.95, col=bettercolors[4], lwd=6, lty=linetypes[4])
segments(seq(-69.47945, 2605.479, length.out=6)[4]+25, 1.00, seq(-69.47945, 2605.479, length.out=6)[4]+225, 1.00, col=bettercolors[5], lwd=6, lty=linetypes[5])
segments(seq(-69.47945, 2605.479, length.out=6)[4]+25, 0.95, seq(-69.47945, 2605.479, length.out=6)[4]+225, 0.95, col=bettercolors[6], lwd=6, lty=linetypes[6])
text(seq(-69.47945, 2605.479, length.out=6)[2]+225, 1.00, cats[3, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[2]+225, 0.95, cats[2, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[3]+225, 1.00, cats[7, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[3]+225, 0.95, cats[8, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[4]+225, 1.00, cats[9, 2], font=1, cex=1.25, pos=4)
text(seq(-69.47945, 2605.479, length.out=6)[4]+225, 0.95, cats[11, 2], font=1, cex=1.25, pos=4)
dev.off()

Study 2

Structural topic model

In order to ascertain the topics contained in the Twitter data we collected (see paper for a detailed description of the data), we fitted a Structural Topic Model (or STM) to the corpus. STMs are a form of unsupervised machine learning in which documents are grouped together into topic categories based on so-called FREX words. FREX words are those which are frequently occuring within a group, but which are also relatively exclusive to that group. It is then incumbent upon the researcher to interpret the FREX words and high-fitting examples to ascertain what a group of documents mean.

This type of model is unique from other forms of topic modeling, such as Latent Dirichlet allocation, in that it allows the researcher to specify document covariates which may help the model fit topics. We’ll show how this works in the code below. For now, let’s start by loading in the libraries we’ll need for this section and reading in the full Twitter corpus.

library(stm)
library(tm)
library(stringr)
library(openxlsx)
library(SnowballC)
library(wordcloud)
library(Rtsne)
library(rsvd)
library(geometry)

RawTwitter<-readRDS("Path_Data_Aggregation.rds")

For the sake of privacy of the Twitter users in our corpus, we will not be publishing the full data set online However, know that we use three variables in the following code: Date_EST which is a UTC timestamp indicating when the tweet was published, Contents which contains the text of the tweet, and Author or the username of the author of the tweet. A mock dataframe is provided below:

Date_EST Contents Author
2010-11-15 16:26:43 @Path is a new social networking app. Can it compete with Facebook? @UserName1
2010-11-15 16:27:18 I used @Path stickers to decorate my laptop - What do you think?! @UserName2
2010-11-15 16:29:57 RT @username Path is a new app focused on close social networks. Check it out today on the Apple App Store! @UserName3

Before we model the STM, we need to clean up the text. First, Crimson Hexagon seems to have included some tweets which are not classified as retweets, but which nevertheless follow the convention for a retweet (see the 3rd example above). We specifically want to remove tweets which contain the capital letters “RT” followed by a space. We can use the grep function in base R to index those observations:

RawTwitter$RT<-0
RawTwitter$RT[grep("RT ", RawTwitter$Contents, ignore.case=F)]<-1

Next, there are a number of tweets which contain links to other websites. Since the content of those websites is out of the scope of this study, we elected to remove them as well. We can find links through a similar process:

RawTwitter$HTTP<-0
RawTwitter$HTTP[grep("http", RawTwitter$Contents, ignore.case=T)]<-1

Let’s take a look at how many cases either have a link or are retweets:

##              RawTwitter$HTTP
## RawTwitter$RT     0     1
##             0 43349 17093
##             1  6954  6942

And now let’s remove those cases from the data:

data<-RawTwitter[(RawTwitter$RT==0 & RawTwitter$HTTP==0),]

Next, there are some words or phrases contained in these tweets which we found results in over-fitting the few remaining non-English tweets to a topic. This is problematic because it hinders our ability to interpret the meaning of a group of documents. We found that removing usernames (“@.*”) and anything with the word “path” in it (“.*path.*”) seems to at least ensure that the non-English tweets are now entirely in the author’s native language. Moreover, the username contributes little to the substantive meaning of English tweets as well.

Let’s start with removing usernames – first, we’ll make a new variable to store the cleaned contents called Content_wo_Users. Then we’ll clean up the excess spaces using str_trim and str_squish from the stringr package. After that, we can split up the string using spaces to separate it out into a vector. We then used the gsub command to replace "@.*" with "". The ".*" in the first parameter means that R will look for word that begins with @ and is followed by any set of characters, such as “@abc123”.

data$Content_wo_Users<-NA

for(i in 1:dim(data)[1]){
  temptext<-data$Contents[i]
  temptext<-str_trim(temptext)
  temptext<-str_squish(temptext)
  templist<-unlist(str_split(temptext, " "))
  templist<-gsub("@.*", "", templist)
  temptext<-templist[1]
  if(length(templist)>1){
    for(j in 2:length(templist)){
      if(templist[j]!=""){
        temptext<-paste(temptext, templist[j], sep=" ")
      }
    }
  }
  data$Content_wo_Users[i]<-temptext
}

Next, we’ll remove instances of “path” using a similar process. This time, we set the ignore.case parameter to TRUE to ensure that both “path” and “Path” are removed.

for(i in 1:dim(data)[1]){
  temptext<-data$Content_wo_Users[i]
  temptext<-stringr::str_trim(temptext)
  temptext<-stringr::str_squish(temptext)
  templist<-unlist(stringr::str_split(temptext, " "))
  templist<-gsub(".*path.*", "", templist, ignore.case=T)
  temptext<-templist[1]
  if(length(templist)>1){
    for(j in 2:length(templist)){
      if(templist[j]!=""){
        temptext<-paste(temptext, templist[j], sep=" ")
      }
    }
  }
  data$Content_wo_Path[i]<-temptext
}

Great! Now that the text is cleaned up, let’s take a look at a word clouds. Word clouds are useful for visualizing term frequency in a given corpus. Firts, let’s convert the character vector into a corpus using the tm package – we’ll also remove punctuation and stopwords (e.g., “the” or “of”).

textCorp<-Corpus(VectorSource(data$Content_wo_Path))
textCorp<-tm_map(textCorp, removePunctuation)
textCorp<-tm_map(textCorp, removeWords, stopwords('en'))

Now let’s make the word cloud using the corpus we just created. This function comes from the wordcloud package. Note: elipses or blank boxes are used to replace unicode characters such as emojis or non-English characters which the wordcloud package is incapable of plotting.

If you would like to recreate your own figure, use the following code:

pdf(file="wordcloud.pdf", width=6.5, height=6.5)
par(mar=c(0,0,0,0))
wordcloud(textCorp, max.words=500, random.order=FALSE)
dev.off()

Turning now to the STM, we first need to isolate the document metadata that we would like to include in the model as a covariate. First, let’s focus on the date. In order to include this into the model, we opted to convert the date to a numeric variable – starting with 0 at the earliest date in the data. We stripped the time from Date_EST, then converted the date variable to numeric and subtracted the minimum date.

data$date<-as.Date(data$Date_EST)
data$datenum<-as.numeric(data$date-min(data$date))

Next, let’s put this together with Author into a data frame called meta, and we’ll double check that the vectors are the correct class for analysis.

meta<-as.data.frame(cbind(data$Author, data$datenum))
names(meta)<-c("author", "date")
meta$author<-as.factor(unlist(meta$author))
meta$date<-as.numeric(as.character(meta$date))

Then we need to use the textProcessor function in the stm package. This does a number of string transformations all at once:

  • Associates documents (tweets) with our metadata
  • Converts all characters to lower case
  • Removes stop words (SMART list; Salton, 1971)
  • Removes numbers
  • Removes punctuation

Normally, this process would stem words to (e.g., converting “communication” and “communicating” to “communicat” so the document-term matrix can be collapsed). However, this has largely fallen out of vogue in modern natural language processing research, as they can sometimes create similarities between documents when none should exist.

textPro<-textProcessor(data$Content_wo_Path, metadata=meta, stem=F, verbose=T)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...

This processing often removes whole documents from the corpus (in our case, it removes 486 of the 43349 observations) because they contained only those things which were removed in processing. This will be really important to use later on, so we’ll save out these results.

saveRDS(textPro, file="textPro.rds")

The next processing step is to run prepDocuments, again from the stm package. This step prepares the documents, vocab, and meta components (output from textProcessor) used by the STM. We set lower.thresh to 0 to prevent this step from removing words which appear in only one document – this keeps the document-term matrix from being to sparse to fit the model. If running on a corpus with higher word counts than tweets, this may not be necessary.

textOut<-prepDocuments(textPro$documents, textPro$vocab, textPro$meta, lower.thresh=0, verbose=T)

Now that the documents are prepared, we can conduct a procedure called searchK. This estimates a series of STMs across a wide range of k values, or the number of topics modeled. A researcher can use this technique to find a k suitable for the final model. This step can be very computationally intensive, and so it is best-suited for Mac and Linux machines which can utilize the cores parameter to parallelize across multiple CPU cores. Depending on the corpus, a large amount of RAM may be required as well – our corpus of ~43,000 tweets runs fine with 8gb or more on a single processing thread. When compiling this document on a machine running R 4.0.0 on Mac OSX 10.15.4 with an Intel Core i7 7820X (16 thread, 3.6 GHz) and 64 gb of memory, the searchK completed in approximately 3 hours. We also highly recommend setting a seed for the heldout documents so that the process is repeatable if you lose your progress part way through.

kresult_2_100 <- searchK(documents=textOut$documents, vocab=textOut$vocab, K=(2:100), data=textOut$meta, heldout.seed=756149, cores=parallel::detectCores(), verbose=T)
## Using multiple-cores.  Progress will not be shown.

The verbose output will be blocked when running in parallel, but if run using a single core, this parameter can be used to track progress.

To evaluate the results of the searchK, one can simple plot the object:

stm::plot.searchK(kresult_2_100)

When reading these plots to determine a k value for the final model, they can each be read in a manner similar to a scree plot in factor analyses. That is, one should look for the “elbow” in the line – where that change occurs along the x-axis should be the k used for the full STM. For an example of a diagnostic plot with a clear k value, see the supplemental materials for Sweitzer & Shulman (2018), avaiable here.

Unfortunately, our searchk results are unclear at best. Luckily, there is another option. The stm package implements a non-determinisitic initialization function developed by Lee and Mimno (2014) which estimates a k value that is appropriate for the data. To add this to the final model estimation, we set the parameter K to 0 and the init.type to ‘Spectral’. We’ll set the seed here to the same one used for the paper so that the same set of documents are held out in the estimation. The verbose parameter can be used to show the progress (hidden in this file). Finally, we recommend setting a higher max.em.its and maxV than the default parameters to allow the estimation maximization enough time to fit the model well.

mod <- stm(documents=textOut$documents, vocab=textOut$vocab, K=0, data=textOut$meta, init.type='Spectral', seed=169577, verbose=F, max.em.its=10000, control=c(maxV=10000))

Before we do anything with the results, let’s save the model so we can come back to this later without running it again:

saveRDS(mod, "stm.rds")

Let’s first take a look at the FREX words – These, again, are the most frequently occuring, and relatively exclusive words within a group of documents. You can adjust the n parameter to print more or fewer words per topic, but we find that the default of 7 suit our data well.

labelTopics(mod, n=7)$frex

Here is what our FREX table looks like:

Topic FREX 1 FREX 2 FREX 3 FREX 4 FREX 5 FREX 6 FREX 7
1 seeing bit forward exciting moving redesign killing
2 stickers pic public sticker bought spend grow
3 twitter post notifications posts posting connect push
4 understand alone designer fav devs pick engineer
5 took line due process tracking ship discovery
6 something probably missing wrong similar cover image
7 updates two weeks hate business status couple
8 ios apps issue updated platform running crashes
9 since makes user data api believe end
10 thanks come haha checking signed making kind
11 say life real must checked broken lately
12 congrats big via excited release launch fan
13 friends many family join close small limit
14 android users fail beta filter updating example
15 try ill look interesting maybe give guess
16 thats others goes called guy incredible comes
17 account please email help fix dear response
18 sleep hours screen button asking especially noticed
19 sharing video everything filters super concept happen
20 even phone anybody damn contacts text windows
21 really seems lot feel yup pupil died
22 feature music search send found saw bug
23 see everyone going everybody downloading okay blackberry
24 service dave apparently welcome ceo fine morin
25 think uses nobody yep game model stick
26 another personal favorite happened networking dead honestly
27 can tell omg export vine ugly imagine
28 wish used web comments worked knew potential
29 version wonder worth latest buy fucking behind
30 may mean free ask premium pay stream
31 still trying yet though havent around issues
32 right let wont person profile possible fact
33 new whats officially sexy diggin avatar tryin
34 app followers luh hip castle chillen confusedmakers
35 now cool pretty needs hell slick digging
36 much yes thank agree totally weird completely
37 someone arent explain sending showing recently locations
38 get point deleting sucks enough know stupid
39 share photos able moment pics camera etc
40 last week year party night sxsw went
41 youre loving timeline stop reason years enjoy
42 privacy tech online quality room becoming press
43 well hope experience serious otherwise helps ’ll
44 network private part site almost tool sites
45 looks check sure actually far sweet wow
46 facebook feels simple clean leave simply keeping
47 better instagram fuck whole gone move told
48 one every things definitely designed word beautifully
49 using kinda application acting addme shiny transition
50 guys takes main visit fyi suggestion rock
51 time deleted ago tho opened obsessed battery
52 use idea dont figure often hard different
53 confusing bout idk perfect yea addicted brilliant
54 working moments isnt activity except surprised missed
55 find doesnt wait seem says log feed
56 friend yeah show list added either requests
57 photo didnt long tweet question full decided
58 just getting downloaded thinking got already sorry
59 lol delete yall thought true liking put
60 ive thing never ever seen always anything
61 good work bad stuff morning hear quite
62 like seriously starting love might tried sounds
63 will take next nike future fuelband run
64 add download remember shit follow exactly whos
65 update today features store messaging available slow
66 nice view left meet finding itd stuck
67 coming number stay saying smart hopefully mention
68 first place thoughts world came socialmedia checkin
69 mobile problem intimate case rather break extra
70 design beautiful team amazing job product impressed
71 gonna chance back gotta sad finally bored
72 also location without sign theres change address
73 make hey made option name shows movies
74 anyone crashing liked elses hardly socialnetwork flash
75 way works joined little wanted set useable

Typically, if you wanted to look at some of the top-fitted examples for each topic to help you label the output, you could simply run the findThoughts function available in the stm package. This function takes three inputs: the estimated model (mod in our code), the original texts used to construct the corpus (data$Content_wo_Path in our code), and n, or the number of examples you would like to pull. However, because this uses the top theta value (or the fit score) from the results, this can effectively result in all of the non-English tweets “floating to the top” of the results. This is because those tweets contain highly exclusive words, encoded in their original language, which means they will fit highly to one topic in the stm results. Instead, we developed a new means of pulling examples, described in greater detail in the paper.

For our purposes, we’ll need to construct a data frame that can be exported to excel for manual review and interpretation. We begin by constraining our original data frame to contain only the observations which were retained after processing.

data<-data[-(textPro$docs.removed),]

Next, we’ll need to count how many tweets were maximally fitted to each of the 75 topics. We’ll start by adding two variables to the original data frame, Topic which will inheret the integer (1:75) corresponding to that observation’s top-fitted topic, and Topic_fit, which will take on the theta value, or fit score between 0 and 1, for that topic. Theta can be understood as the proportion of the content in the observation which corresponds to the stm topic. In our for loops below, we pull the maximum value from the theta matrix for the corresponding row, then we pull the topic (1:75) which equals that maximum theta.

data$Topic<-NA
data$Topic_fit<-NA
for(i in 1:dim(data)[1]){
  data$Topic_fit[i]<-max(mod$theta[i,])
  for(j in 1:dim(mod$theta)[2]){
    if(mod$theta[i,j]==data$Topic_fit[i]){
      data$Topic[i]<-j
    }
  }
}

Now we can start constructing our example frame. We’ll start with the topic numbers (1:75) and how many tweets were fitted best to those topics. Because we used the Lee and Mimno (2014) initialization procedure, our last topic (75) will have zero tweets which fit best to it, so we add “0” to the list n in the last position. Then we’ll add in the FREX words we looked at before.

topic<-1:75
n<-as.numeric(table(data$Topic))
n<-c(n, 0)
frex<-as.data.frame(cbind(topic, n))

frex$frex_1<-labelTopics(mod, n=7)$frex[,1]
frex$frex_2<-labelTopics(mod, n=7)$frex[,2]
frex$frex_3<-labelTopics(mod, n=7)$frex[,3]
frex$frex_4<-labelTopics(mod, n=7)$frex[,4]
frex$frex_5<-labelTopics(mod, n=7)$frex[,5]
frex$frex_6<-labelTopics(mod, n=7)$frex[,6]
frex$frex_7<-labelTopics(mod, n=7)$frex[,7]

Let’s look at the head:

head(frex)
##   topic    n     frex_1   frex_2        frex_3   frex_4   frex_5   frex_6
## 1     1  258     seeing      bit       forward exciting   moving redesign
## 2     2  349   stickers      pic        public  sticker   bought    spend
## 3     3 1275    twitter     post notifications    posts  posting  connect
## 4     4  219 understand    alone      designer      fav     devs     pick
## 5     5  280       took     line           due  process tracking     ship
## 6     6  481  something probably       missing    wrong  similar    cover
##      frex_7
## 1   killing
## 2      grow
## 3      push
## 4  engineer
## 5 discovery
## 6     image

Next, we’ll create the columns that we’ll write the examples to:

frex$ex_1<-NA
frex$ex_2<-NA
frex$ex_3<-NA
frex$ex_4<-NA
frex$ex_5<-NA
frex$ex_6<-NA
frex$ex_7<-NA
frex$ex_8<-NA
frex$ex_9<-NA
frex$ex_10<-NA

Now comes the tricky part. To pull our examples, we want to be able to order our observations based on how many of the seven FREX words are contained in the tweet, for each of the 75 topics. This means we will need 42,863 x 7, or 300,041 dichotomous measures, plus one per observation to average the 7 counts. Let’s start by creating vectors filled with 42,863 zeros – eight vectors for each topic: 1 per FREX word and 1 mean.

Then, we need to change 0’s to 1’s when the FREX word is contained in the tweet. We’ll use two nested for loops for this: one to iterate through the 75 topics, and one to iterate through the 7 FREX words and the mean. Inside the second for loop, when j is equal to 1:7, we’ll take the FREX word for the current topic, retrieve the corresponding list of 0’s, and use a logical grep function, grepl, to replace the 0 with 1 for the observations which contain the word, regardless of case. Then, when j is 8, we’ll fill the mean with the average of the 1:7 lists. This gives us one vector with the proportion of the 7 FREX words contained in each tweet for each topic.

for(i in 1:75){
  assign(paste("Top", i, "_Frex1", sep=""), rep(0, dim(data)[1]))
  assign(paste("Top", i, "_Frex2", sep=""), rep(0, dim(data)[1]))
  assign(paste("Top", i, "_Frex3", sep=""), rep(0, dim(data)[1]))
  assign(paste("Top", i, "_Frex4", sep=""), rep(0, dim(data)[1]))
  assign(paste("Top", i, "_Frex5", sep=""), rep(0, dim(data)[1]))
  assign(paste("Top", i, "_Frex6", sep=""), rep(0, dim(data)[1]))
  assign(paste("Top", i, "_Frex7", sep=""), rep(0, dim(data)[1]))
  assign(paste("Top", i, "_FrexM", sep=""), rep(0, dim(data)[1]))
}

for(i in 1:75){
  for(j in 1:8){
    if(j<8){
      tempword<-frex[i, j+2]
      templist<-get(paste("Top", i, "_Frex", j, sep=""))
      templist[grepl(paste("\\<", tempword, "\\>", sep=""), data$Content_wo_Path, ignore.case=T)]<-1
      assign(paste("Top", i, "_Frex", j, sep=""), templist)
      rm("tempword", "templist")
    }
    if(j==8){
      templistM<-get(paste("Top", i, "_FrexM", sep=""))
      templist1<-get(paste("Top", i, "_Frex1", sep=""))
      templist2<-get(paste("Top", i, "_Frex2", sep=""))
      templist3<-get(paste("Top", i, "_Frex3", sep=""))
      templist4<-get(paste("Top", i, "_Frex4", sep=""))
      templist5<-get(paste("Top", i, "_Frex5", sep=""))
      templist6<-get(paste("Top", i, "_Frex6", sep=""))
      templist7<-get(paste("Top", i, "_Frex7", sep=""))
      for(k in 1:length(templistM)){
        templistM[k]<-mean(c(templist1[k], templist2[k], templist3[k], templist4[k], templist5[k], templist6[k], templist7[k]))
      }
      assign(paste("Top", i, "_FrexM", sep=""), templistM)
      rm("templistM", "templist1", "templist2", "templist3", "templist4", "templist5", "templist6", "templist7")
    }
  }
}

Now that we have the proportions we’ll use for our examples, let’s start pulling them and adding them to our data frame. In this step, it’s helpful to imagine the original data as a deck of ~43,000 cards. Each card in the deck has 75 “scores”, each ranging from 0/7 (no FREX words) to 7/7 (all FREX words) for each topic. For each topic, we want to draw ten cards or examples.

Those ten cards should be organized from highest score to lowest score, so we’ll start by creating as many as eight smaller decks from our large deck of cards: one for the 7/7 cards, one for the 6/7 cards, one for the 5/7 cards, and so on. If we start making these smaller decks by “dealing” them as one deals cards from a deck, then we’ll have a new problem. The big deck is currently ordered in time so that the earliest tweets appear at the top of the deck and the latest tweets are at the bottom. When we start drawing our ten examples from these decks, they’ll retain that ordering. We don’t want to base our interpretation of the topics on examples which appear earlier or later. So we need to do one more thing to those smaller decks: shuffle them!

Once the smaller decks are shuffled, we’ll draw our ten examples. We’ll start with the highest proportion deck and work our way down the decks until we have ten. Imagine that we’re working on the first topic and we have four decks: 2 tweets contain 3/7 FREX words, 6 tweets contain 2/7 FREX words, 137 contain 1/7 FREX words, and the remaining ~43k contain 0/7 FREX words. We’ll shuffle these decks among themselves. Then we need to draw ten in total, so we’ll first draw the two tweets from the first deck. Then, we’ll move to the second deck and draw all 6 tweets. Finally, we’ll move to the third deck and draw 2 tweets to give us 10 total.

Along the way, we’ll need some additional measures to keep track of how this process is working for each of our topics. maxlevel will record the highest “deck”, or the highest proportion of FREX words, that we start drawing our examples from. minlevel will record the lowest “deck” that we stop drawing from once we have ten examples. And depth will keep track of how many “decks” we have to draw from to reach ten. We want to make sure that we are never drawing from the deck with 0/7 FREX words, since those will tell us nothing about what the topic means. Better examples will be drawn for topics with a higher maxlevel and a lower depth. It should also be the case the every non-English tweet appears in the 0/7 deck for every topic, since every FREX word is an English word.

We’ll start with those measures, since they’re the easiest to run and can be used to diagnose potential problems with this process. For maxlevel and minlevel, we’ll order the mean vectors we created in the last step from highest to lowest, take the first ten scores, then return the highest and the lowest multiplied by 7 (to change from proportion to count of FREX words). Then for the depth, we’ll just subtract the minlevel from the maxlevel and add 1.

maxlevel<-rep(NA, 75)
minlevel<-rep(NA, 75)

for(i in 1:75){
  maxlevel[i] <- max(head(get(paste("Top", i, "_FrexM", sep=""))[order(get(paste("Top", i, "_FrexM", sep="")), decreasing=T)], 10)*7)
  minlevel[i] <- min(head(get(paste("Top", i, "_FrexM", sep=""))[order(get(paste("Top", i, "_FrexM", sep="")), decreasing=T)], 10)*7)
}

depth<-(maxlevel-minlevel)+1

Now lets look at those results:

table(maxlevel)
## maxlevel
##  2  3  4 
## 25 45  5
table(minlevel)
## minlevel
##  1  2  3 
##  3 59 13
table(depth)
## depth
##  1  2 
## 30 45

Here, we can see that the minimum level is never 0 – so we’ll never draw examples which don’t contain any of the FREX words in a topic. The depth is also only ever 1 or 2, meaning that the quality of the examples we pull should be pretty uniform. This looks great, so let’s move forward with pulling examples this way.

In the following code, we’ll cycle through each of the 75 topics. When the depth is 1, meaning we can pull all ten examples just by drawing from the highest proportion “deck”, the process is relatively straight-forward: first, create a vector templist of the indices of the mean list which equal the maximum. Then set a seed and sample without replacement 10 indices. Finally, write the content to the example variables in our frex data frame.

When the depth is 2, the process is a little more complicated. First, we need a list of the indices of the mean lists which equal the maximum, like before. This list should be shorter than 10. When that list contains only one index, we can simply write that index to the start of a new list that we’ll add to called randlist. When the length of indices is greater than 1, we’ll set a seed and shuffle them without replacement. Next, we need to count how many we need to draw from the second deck, so toAdd stores 10 minus the number of tweets we pulled from the highest proportion group. Then we’ll create a second list that contains the indices of the mean list which equal the second highest proportion of FREX words. When that second list contains just one index, we’ll add that to the end of randlist. If the second list contains more than one index, we’ll sample toAdd number of indices from the second list and add that to the end of randlist. Finally, we’ll write the indexed content to our example variables in the frex data frame.

for(i in 1:dim(frex)[1]){
  #Depth 1
  if(depth[i]==1){
    templist<-(1:dim(data)[1])[get(paste("Top", i, "_FrexM", sep=""))==max(get(paste("Top", i, "_FrexM", sep="")))]
      set.seed(1280)
      randlist<-sample(x=templist, size=10)
      frex$ex_1[i]<-data$Content_wo_Path[randlist[1]]
      frex$ex_2[i]<-data$Content_wo_Path[randlist[2]]
      frex$ex_3[i]<-data$Content_wo_Path[randlist[3]]
      frex$ex_4[i]<-data$Content_wo_Path[randlist[4]]
      frex$ex_5[i]<-data$Content_wo_Path[randlist[5]]
      frex$ex_6[i]<-data$Content_wo_Path[randlist[6]]
      frex$ex_7[i]<-data$Content_wo_Path[randlist[7]]
      frex$ex_8[i]<-data$Content_wo_Path[randlist[8]]
      frex$ex_9[i]<-data$Content_wo_Path[randlist[9]]
      frex$ex_10[i]<-data$Content_wo_Path[randlist[10]]
      rm(list=c("templist", "randlist"))
  }
  #Depth 2
  if(depth[i]==2){
    #First "deck"
    templist1<-(1:dim(data)[1])[get(paste("Top", i, "_FrexM", sep=""))==max(get(paste("Top", i, "_FrexM", sep="")))]
    if(length(templist1)==1){
      randlist<-templist1
    }
    if(length(templist1)>1){
      set.seed(1280)
        randlist<-sample(x=templist1)
      }
      toAdd<-10-length(templist1)
      #Second "Deck
      templist2<-(1:dim(data)[1])[get(paste("Top", i, "_FrexM", sep=""))==unique(get(paste("Top", i, "_FrexM", sep="")))[order(unique(get(paste("Top", i, "_FrexM", sep=""))), decreasing=T)][2]]
    if(length(templist2)==1){
      randlist<-c(randlist, templist2)
    }
    if(length(templist2)>1){
      set.seed(1280)
        randlist<-c(randlist, sample(x=templist2, size=toAdd))
    }
      #Write to data frame
      frex$ex_1[i]<-data$Content_wo_Path[randlist[1]]
      frex$ex_2[i]<-data$Content_wo_Path[randlist[2]]
      frex$ex_3[i]<-data$Content_wo_Path[randlist[3]]
      frex$ex_4[i]<-data$Content_wo_Path[randlist[4]]
      frex$ex_5[i]<-data$Content_wo_Path[randlist[5]]
      frex$ex_6[i]<-data$Content_wo_Path[randlist[6]]
      frex$ex_7[i]<-data$Content_wo_Path[randlist[7]]
      frex$ex_8[i]<-data$Content_wo_Path[randlist[8]]
      frex$ex_9[i]<-data$Content_wo_Path[randlist[9]]
      frex$ex_10[i]<-data$Content_wo_Path[randlist[10]]
      rm(list=c("templist1", "templist2", "toAdd", "randlist"))
  }
}

Note: the extra step in the second “deck” portion of code where we see if the lists only contain 1 index is ABSOLUTELY necessary! If we just had sample() and the list contained only one index, then the function would return a sample of the list of integers between 1 and the index contained in the list, rather than just the index in the list.

The last thing we need to do is to append the depth, maxlevel, and minlevel to the data frame and save it as an excel file. The authors then interpretted the FREX words and examples manually to determine what each topic concerns.

frex<-cbind(frex, depth, maxlevel, minlevel)
write.xlsx(frex, "stm_75_wExamples.xlsx")

Network analysis

Once we have the STM results, we can show relationships between topics using some basic network analyses. To start, we’ll load in the required packages for this section:

library(igraph)
library(colorspace)
library(showtext)

In the plotting code below, we used the helvetica font to draw some of the text. This is not a necessary step, but it can make the base plotting used for networks look a bit better. The code below should work for most Macintosh systems; your mileage may vary on other operating systems.

setwd("/System/Library/Fonts/")
font_add("helvet", regular = "Helvetica.ttc")

Next, let’s load in the STM results that we saved earlier, and we’ll extract just the theta matrix – or the document by topic matrix filled with fit values.

stmresults<-readRDS("stm.rds")
stmresults<-stmresults$theta

To find relationships between STM topics, we’ll use the correlation function from base R. Effectively, this will show how topics tend to fit higher to a certain group of documents in the Twitter data.

cormat<-cor(stmresults)
diag(cormat)<-NA

Next, we decided to collapse STM topics into the thirteen researcher-identified categories. To accomplish this, we’ll simply add the columns which correspond to the categories. Those topic numbers are identified in the comments.

stmresults2<-matrix(nrow=dim(stmresults)[1], ncol=13)

#aesthetics <- 39, 48
#cost <- 30
#expression <- 2
#integration <- 3, 63, 73
#lifelogging <- 18
#network <- 13, 44, 56
#privacy <- 42, 72
#simplicity <- 46

stmresults2[,1]<-stmresults[,39]+stmresults[,48]
stmresults2[,2]<-stmresults[,30]
stmresults2[,3]<-stmresults[,2]
stmresults2[,4]<-stmresults[,3]+stmresults[,63]+stmresults[,73]
stmresults2[,5]<-stmresults[,18]
stmresults2[,6]<-stmresults[,13]+stmresults[,44]+stmresults[,56]
stmresults2[,7]<-stmresults[,42]+stmresults[,72]
stmresults2[,8]<-stmresults[,46]

#bugs <- 6, 8, 9, 14, 17, 20, 22, 41, 55, 65
#downfall <- 7, 25, 38, 51, 54, 59, 71, 74
#rise <- 1, 12, 15, 21, 23, 24, 26, 28, 31, 33, 35, 45, 49, 58, 64, 67, 70, 75
#uncertainty <- 4, 52, 53
#noise <- 5, 10, 11, 16, 19, 27, 29, 32, 34, 36, 37, 40, 43, 47, 50, 57, 60, 61, 62, 66, 68, 69

stmresults2[,9]<-stmresults[,6]+stmresults[,8]+stmresults[,9]+stmresults[,14]+stmresults[,17]+stmresults[,20]+stmresults[,22]+stmresults[,41]+stmresults[,55]+stmresults[,65]
stmresults2[,10]<-stmresults[,7]+stmresults[,25]+stmresults[,38]+stmresults[,51]+stmresults[,54]+stmresults[,59]+stmresults[,71]+stmresults[,74]
stmresults2[,11]<-stmresults[,1]+stmresults[,12]+stmresults[,15]+stmresults[,21]+stmresults[,23]+stmresults[,24]+stmresults[,26]+stmresults[,28]+stmresults[,31]+stmresults[,33]+stmresults[,35]+stmresults[,45]+stmresults[,49]+stmresults[,58]+stmresults[,64]+stmresults[,67]+stmresults[,70]+stmresults[,75]
stmresults2[,12]<-stmresults[,4]+stmresults[,52]+stmresults[,53]
stmresults2[,13]<-stmresults[,5]+stmresults[,10]+stmresults[,11]+stmresults[,16]+stmresults[,19]+stmresults[,27]+stmresults[,29]+stmresults[,32]+stmresults[,34]+stmresults[,36]+stmresults[,37]+stmresults[,40]+stmresults[,43]+stmresults[,47]+stmresults[,50]+stmresults[,57]+stmresults[,60]+stmresults[,61]+stmresults[,62]+stmresults[,66]+stmresults[,68]+stmresults[,69]

Then, just like before, we can find the relationships between topic categories by taking the correlation of the matrix.

cormat2<-cor(stmresults2)
diag(cormat2)<-NA

Then to make the relationships work as weights in a network, we took the absolute value of the correlation matrices, rounder the results to 6 digits, and then multiplied the results by 1,000,000 so that all of the values are integers.

cormat_p<-round(abs(cormat), 6)*1000000
cormat2_p<-round(abs(cormat2), 6)*1000000

Now that the STM results are cleaned up, we can transform these matrices into igraph network objects using the graph_from_adjacency_matrix function.

net1<-graph_from_adjacency_matrix(cormat_p, mode="undirected", weighted=T, diag=F)
net2<-graph_from_adjacency_matrix(cormat2_p, mode="undirected", weighted=T, diag=F)

We’ll also need some vertex labels to identify topic categories:

V(net2)$name<-c("Aesthetics", "Cost", "Expression", "Integration", "Lifelogging", "Network", "Privacy", "Simplicity", "Complaints", "Downfall", "Rise", "Uncertainty", "Noise")

From here on out, we’ll use the aggregated network, as that’s what was reported in the paper. However, you could easily apply any of the analysis or graphing functions on the full topic-to-topic network.

First, let’s look at the weighted edge values; to report the correlation rather than the transformed weights, we’ll go back to the aggregated correlation matrix:

#Maximum - rise & complaints
max(abs(cormat2), na.rm=T)
## [1] 0.4026363
#Minimum - cost & network
min(abs(cormat2), na.rm=T)
## [1] 0.001202756

In the paper, we then examined the eigenvector centrality scores of each topic category. This is simply calculated using the eigen_centrality function from igraph.

eigen_centrality(net2)$vector

Here are the results:

Category Eigenvector Centrality
Rise 1
Complaints 0.971001438605204
Integration 0.858138685083365
Noise 0.840570549865823
Downfall 0.772115024156617
Uncertainty 0.675869143682172
Aesthetics 0.674273578339512
Privacy 0.550893015653835
Network 0.487314442048549
Lifelogging 0.338440740994626
Expression 0.330839138975101
Simplicity 0.265308436436015
Cost 0.232167580529719

Now let’s turn to plotting the network. For the first network graph used in the paper, we used a color scheme that differentiates the topic categories by whether they are theoretical, inductively identified, or noise. Within each of those broader categories, when then differentiate topic categories by shade. To accomplish this, we’ll use the colorRampPalette function from the grDevices package.

collist<-rep(NA, 13)
v.pal<-colorRampPalette(c(rgb(40/255, 0, 128/255), rgb(227/255, 214/255, 255/255)))
collist[1:8]<-v.pal(8)
v.pal<-colorRampPalette(c(rgb(0, 82/255, 22/255), rgb(179/255, 245/255, 196/255)))
collist[9:12]<-v.pal(4)
collist[13]<-"orange3"

Next, let’s assign the color list to a vertex attribute in the network object called "col". That way, we can simply call on that network object in the plotting function and the colors will be indexed appropriately.

V(net2)$col<-collist[1:13]

We then wanted to create a color scheme for the edges which reflects the strength of the correlation between topic categories. We can do this one of two ways:

  1. We could continue to use the absolute value of the correlation matrix to who all relationships as positive (used in the paper)
  2. We could show the negative correlations as a different color (red)

The code below creates edge colors ramped between transparent and black using the absolute values. Uncomment the last three lines to add in the red color for negative correlations.

col1<-rgb(1, 1, 1, 0)
col2<-"black"
col3<-"darkred"
ep.pal<-colorRampPalette(c(col1, col2), alpha=T)
ep.col<-ep.pal(max(E(net2)$weight))
en.pal<-colorRampPalette(c(col1, col3), alpha=T)
en.col<-en.pal(max(E(net2)$weight))

E(net2)$col<-ep.col[c(E(net2)$weight)]
#E(net2)$pos<-0
#E(net2)$pos[cormat2[lower.tri(cormat2, diag = FALSE)]>0]<-1
#E(net2)$col[E(net2)$pos==0]<-en.col[c(E(net2)$weight[E(net2)$pos==0])]

Next, we need a layout for the nodes. Since we used a correlation matrix to create the network, its fully connected. A layout created from a sparsely-connected network is more informative, so we first removed all of the edges below the median weight. Then, we can use the layout_with_fr function to generate the Fruchterman-Reingold layout. Because this is a non-deterministic process, we set a seed first.

net2b<-delete_edges(net2, E(net2)[E(net2)$weight<=quantile(E(net2)$weight, probs=.5)])
set.seed(748219)
layout2.2<-layout_with_fr(net2b, niter=10000)
layout2.2<-norm_coords(layout2.2)

Great, now that we have all of the components, let’s plot the network:

Finally, the last network analysis we conducted was community detection, specifically using the spinglass method. Because this is non-deterministic, we again set a seed. Then, using the community membership information, we can add a new vertex attribute which we’ll use to plot the color of the communities.

set.seed(246810)
cl<-cluster_spinglass(net2)
V(net2)$fcol<-NA
V(net2)$fcol[cl$membership==1]<-"red3"
V(net2)$fcol[cl$membership==2]<-"springgreen2"
V(net2)$fcol[cl$membership==3]<-"deepskyblue4"

And here’s what that looks like plotted:

If you want to be able to recreate these graphs for yourself, you can use the following code:

pdf("AggNet_1.pdf", 10, 8)
showtext_begin()
par(mar=c(0, 2, 0, 11.25))
plot(net2, edge.color=E(net2)$col, edge.width=(E(net2)$weight/80000), vertex.color=V(net2)$col, vertex.size=15, vertex.label.font=2, vertex.label.cex=1.2, vertex.label.color="black", vertex.label.degree=c((pi*8)/7, (pi*3)/2, pi/12, pi/12, (pi*3)/2, (pi*5)/4, pi/2, (pi*5)/4, pi/2, (pi*19)/32, (pi*9)/5, (pi*2)/7, (pi*25)/24), vertex.label.dist=c(2.75, 1.5, 3, 3, 1.5, 2.2, 1.5, 2.2, 2, 1.6, 1.9, 2, 2.2), vertex.label.family='helvet', layout=layout2.2, rescale=F, xlim=c(-1.05, 1.05), ylim=c(-1.05, 1.05))
rect(1.25, -0.7, 1.75, 0.7, lwd=2)
text(mean(c(1.25, 1.75)), 0.625, "Topic", font=2, cex=1.5)
points(rep(1.4, 13), seq(0.525, -0.6, length.out=15)[c(1:8, 10:13, 15)], col=collist[1:13], pch=19, cex=2)
text(rep(1.4, 13), seq(0.525, -0.6, length.out=15)[c(1:8, 10:13, 15)]-.0075, c("Aesthetics", "Cost", "Expression", "Integration", "Lifelogging", "Network", "Privacy", "Simplicity", "Complaints", "Downfall", "Rise", "Uncertainty", "Noise"), pos=4, cex=0.9)
text(1.325, mean(seq(0.525, -0.6, length.out=15)[c(1, 8)]), "Theoretical", font=3, cex=1.2, srt=90)
text(1.325, mean(seq(0.525, -0.6, length.out=15)[c(10, 13)]), "Additional", font=3, cex=1.2, srt=90)
text(-1.175, 1.125, "A", font=2, cex=3, xpd=T)
showtext_end()
dev.off()

pdf("AggNet_2.pdf", 10, 8)
showtext_begin()
par(mar=c(0, 2, 0, 11.25))
plot(cl, net2, col=V(net2)$fcol, mark.border=NA, mark.col=c(adjustcolor("red3", alpha.f=0.5), adjustcolor("springgreen2", alpha.f=0.5), adjustcolor("deepskyblue4", alpha.f=0.5)), edge.color=E(net2)$col, edge.width=(E(net2)$weight/80000), vertex.size=15, vertex.label.font=2, vertex.label.cex=1.2, vertex.label.color="black", vertex.label.degree=c((pi*8)/7, (pi*3)/2, pi/12, pi/12, (pi*3)/2, (pi*5)/4, pi/2, (pi*5)/4, pi/2, (pi*19)/32, (pi*9)/5, (pi*2)/7, (pi*25)/24), vertex.label.dist=c(2.75, 1.5, 3, 3, 1.5, 2.2, 1.5, 2.2, 2, 1.6, 1.9, 2, 2.2), vertex.label.family='helvet', layout=layout2.2, rescale=F, xlim=c(-1.05, 1.05), ylim=c(-1.05, 1.05))
rect(1.25, -0.7, 1.75, 0.7, lwd=2)
text(mean(c(1.25, 1.75)), 0.625, "Topic", font=2, cex=1.5)
points(rep(1.4, 13), seq(0.525, -0.6, length.out=15)[c(1:8, 10:13, 15)], col=V(net2)$fcol, pch=19, cex=2)
text(rep(1.4, 13), seq(0.525, -0.6, length.out=15)[c(1:8, 10:13, 15)]-.0075, c("Aesthetics", "Cost", "Expression", "Integration", "Lifelogging", "Network", "Privacy", "Simplicity", "Complaints", "Downfall", "Rise", "Uncertainty", "Noise"), pos=4, cex=0.9)
text(1.325, mean(seq(0.525, -0.6, length.out=15)[c(1, 8)]), "Theoretical", font=3, cex=1.2, srt=90)
text(1.325, mean(seq(0.525, -0.6, length.out=15)[c(10, 13)]), "Additional", font=3, cex=1.2, srt=90)
text(-1.175, 1.125, "B", font=2, cex=3, xpd=T)
showtext_end()
dev.off()

Web scraping the WayBack archive

To compare the Twitter results against the full set of Path’s App Store features, we’ll need to first scrape the Wayback machine for all of Path’s App Store webpages. There are a number of packages which will allow us to put this data set together:

library(rjson)
library(rvest)
library(stringr)
library(ngram)
library(parallel)

Note: the parallel package is not necessary, but it helps speed up the process. The parallel code presented here will work on Mac and Linux machines. On Windows, make sure to comment out all instances of parallel code (e.g., “mc.cores=detectCores()”).

We want to create a query for each day between November 15, 2010 (when Path goes live on the App Store) and December 31, 2017 (the end of our Twitter collection). We’ll start by listing out each day in R’s Date format:

dates<-as.character(as.Date(as.Date("2010-11-15"):as.Date("2017-12-31"), origin="1970-01-01"))
dates2<-gsub("-", "", dates)

Next, I wrote a function that takes each day in the list we created, appends the character string to the URL used by the WayBack archive for Path’s App Store page, and performs a JSON query using WayBack’s API and the rjson package. The WayBack API returns an object called “closest URL” – this tells us the most recent snapshot relative to the date provided by the query. By returning the unique results, we can get a URL list of all of the unique snapshots contained in the WayBack archive.

jsonfn<-function(i){
  file<-fromJSON(file=paste("https://archive.org/wayback/available?url=http://itunes.apple.com/us/app/path/id403639508?mt=8&timestamp=", dates2[i], sep=""))
  return(file$archived_snapshots$closest$url)
}
waybacks<-unlist(lapply(1:length(dates2), jsonfn))
waybacks<-unique(waybacks)

Here’s an example of the output provided:

## [1] "http://web.archive.org/web/20101115230956/http://itunes.apple.com/us/app/path/id403639508?mt=8"   
## [2] "http://web.archive.org/web/20101117042833/http://itunes.apple.com/us/app/path/id403639508?mt=8"   
## [3] "http://web.archive.org/web/20101118023115/http://itunes.apple.com:80/us/app/path/id403639508?mt=8"
## [4] "http://web.archive.org/web/20101118044211/http://itunes.apple.com:80/us/app/path/id403639508?mt=8"
## [5] "http://web.archive.org/web/20101123031002/http://itunes.apple.com/us/app/path/id403639508?mt=8"

Next, the date and time of the snapshots are embedded in the URL. These functions pull them out using substring functions from the stringr package:

datefn<-function(i){
  year<-str_sub(waybacks[i], 28, 31)
  month<-str_sub(waybacks[i], 32, 33)
  day<-str_sub(waybacks[i], 34, 35)
  date<-paste(year, "-", month, "-", day, sep="")
  return(date)
}

dates<-unlist(mclapply(1:length(waybacks), datefn, mc.cores=detectCores()-1))

timefn<-function(i){
  hours<-str_sub(waybacks[i], 36, 37)
  minutes<-str_sub(waybacks[i], 38, 39)
  seconds<-str_sub(waybacks[i], 40, 41)
  time<-paste(hours, ":", minutes, ":", seconds, sep="")
  return(time)
}

times<-unlist(mclapply(1:length(waybacks), timefn, mc.cores=detectCores()-1))

Now, put these together in a data frame called wayback:

wayback<-as.data.frame(cbind(dates, times, waybacks))
names(wayback)<-c("Date", "Time", "URL")
wayback$URL<-as.character(wayback$URL)

Now that we have the URL for each snapshot, we need to pull the description text. This function uses the rvest package to:

  • read in the html file
  • drill down to the node called “.product-review”, where the description is stored
  • convert that to an R character string
    • if the result is more than one character string, paste them together
descfn<-function(i){
  webpage<-read_html(wayback$URL[i])
  prodrev<-html_nodes(webpage,'.product-review')
  outs<-html_text(prodrev)
  if(length(outs)>1){
    temp<-outs[1]
    for(j in 2:length(outs)){
      temp<-paste(temp, outs[j])
    }
    outs<-temp
  }
  return(outs)
}

wayback$Description<-lapply(1:dim(wayback)[1], descfn)

Here’s an example of the result:

## [[1]]
## [1] "\n  \n    \n      \n      Description\n    \n    \n  \n  \n    Path is the personal social network that is the best way to share life and stay connected with family and friends. By focusing on beautiful design and experience, Path is 5-star rated and loved by millions of people.Be closer to those you care about and remember every important moment: photos, videos, places, music, movies, workouts, when you wake up and go to sleep, and more.And for those of you who want one place to share to all of your favorite networks like Facebook, Twitter, Tumblr and Foursquare, you can easily check in, upload photos and videos, blog, and tweet directly from Path, all-in-one.Features:★ Capture photos and videos with custom-designed filters and editing features.★ Journal your thoughts, when you're asleep or awake, and check in to places with friends.★ Share the music you're listening to, movies you're watching, and books you're reading.★ Interact with your friends' posts: Smile, Laugh, Gasp, Frown at or Love their moments.★ All-in-one posting to Facebook, Twitter, Tumblr, and Foursquare.★ New! Search all your moments.★ Import your photos, statuses, and check-ins from Instagram, Facebook, and Foursquare, and use Search to tell great stories.★ Share your daily runs, walks, or hikes with Nike+ Running and daily activity with Nike+ FuelBand.★ Available on iPhone, iPad, iPad mini, and Android.★ iPad only: Landscape view allows you to browse the most interesting moments, day by day.\n  \n  \n \n  \n    \n      \n      What's New in Version 2.9.1\n    \n    \n  \n  \n    ★ Search - rediscover memories on Path by season, moment-type, holidays, weather, dates, people, places, nearby, and more! ✓ Performance improvements. ☂ Bug fixes.\n  \n  \n"

Now that we have the description texts, we next need to begin parsing the paragraphs for the important parts. In the example above, notice how the feature list appears before the version number. This will be an important split point and we can extract the version number in the process. The code below starts by adding the version 1.0 to all observations for which the description does not contain the phrase “What’s New in Version”. For other observations, we’ll split the text using this phrase. The first half will be retained and we can extract the version number from the second half using a substring function. Once this step is completed, we’ll get rid of d1 and d2 – the split halves – as they’re no longer needed.

wayback$Version<-NA
wayback$Version[!grepl("What's New in Version", wayback$Description)]<-"1.0"
wayback$d1<-NA
wayback$d2<-NA

splitFn1<-function(i){
  if(!grepl("What's New in Version", wayback$Description[i])){
    return(wayback$Description[i])
  }
  if(grepl("What's New in Version", wayback$Description[i])){
    return(unlist(str_split(unlist(wayback$Description[i]), "What's New in Version"))[1])
  }
}

splitFn2<-function(i){
  if(!grepl("What's New in Version", wayback$Description[i])){
    return(NA)
  }
  if(grepl("What's New in Version", wayback$Description[i])){
    return(unlist(str_split(unlist(wayback$Description[i]), "What's New in Version"))[2])
  }
}

wayback$d1<-unlist(mclapply(1:dim(wayback)[1], splitFn1, mc.cores=detectCores()-1))
wayback$d2<-unlist(mclapply(1:dim(wayback)[1], splitFn2, mc.cores=detectCores()-1))

wayback$Version[grepl("What's New in Version", wayback$Description)]<-substr(wayback$d2, 0, 8)[grepl("What's New in Version", wayback$Description)]

wayback$Version<-gsub("\n", " ", wayback$Version)
wayback$Version<-str_trim(wayback$Version)

wayback$Description<-wayback$d1
wayback<-wayback[1:5]

The next step involves cleaning up a lot of the extraneous text formatting. Using the gsub command from base R, we’ll replace the “Description” heading and linebreaks with spaces. Then we’ll remove extra spaces from the ends and between each word using the str_trim and str_squish functions (respectively) found in the stringr package.

wayback$Description<-gsub("Description\n", " ", wayback$Description)
wayback$Description<-gsub("\n", " ", wayback$Description)
wayback$Description<-str_trim(wayback$Description)
wayback$Description<-str_squish(wayback$Description)

Now that the texts are prepared, the next step will be to extract the individual features from their lists. Unfortunately, Path was not consistent in the way they formatted their features lists over the ~7 years we collected. So in order to split the text, we’ll need to code a rule for each observation which captures the format that was used. The solution provided below is inelegant, but it works for our data set. Your mileage will, of course, vary.

wayback$SplitRule<-NA
#Split on "•"
  wayback$SplitRule[grepl("•", wayback$Description)]<-1
#Split on "★"
  wayback$SplitRule[grepl("★", wayback$Description) & !grepl("★ ★", wayback$Description)]<-2
#Split on "☆"
  wayback$SplitRule[grepl("☆", wayback$Description)]<-3
#Split first on "Features:", then on "- "
  wayback$SplitRule[grepl("Features:", wayback$Description) & is.na(wayback$SplitRule)]<-"4a"
#Just split on "- "
  wayback$SplitRule[is.na(wayback$SplitRule)][grepl("On Path, you can:", wayback$Description[is.na(wayback$SplitRule)])]<-"4b"
#Split on "[.]"
  wayback$SplitRule[is.na(wayback$SplitRule)][grepl(".*:", wayback$Description[is.na(wayback$SplitRule)])]<-"5a"               
#Split on "[.]", with "Features" at the start
  wayback$SplitRule[wayback$SplitRule=="5a" & grepl("Features", wayback$Description)]<-"5b"

Next, let’s make some empty feature variables to fill. Through some experimentation (starting with many more than 14 and trimming back the code later), I found that the maximum number of features contained in a single observation is 14.

wayback$Feature_1<-NA
wayback$Feature_2<-NA
wayback$Feature_3<-NA
wayback$Feature_4<-NA
wayback$Feature_5<-NA
wayback$Feature_6<-NA
wayback$Feature_7<-NA
wayback$Feature_8<-NA
wayback$Feature_9<-NA
wayback$Feature_10<-NA
wayback$Feature_11<-NA
wayback$Feature_12<-NA
wayback$Feature_13<-NA
wayback$Feature_14<-NA

Great! Now comes the hard part: splitting the character string according to each of the seven rules described above. First, we’ll cycle through each observation using a for loop. Then, we’ll determine which of the seven rules to apply. If the rule states that “Feature” precedes the list, we’ll first split on that word. Then, we’ll split on the character string shown above, trimming extra spaces from the ends as we do so. The result of this split is a vector, so we can iterate through the length of the list and use that index (j) to determing which variable in the data frame to write to. Some additional formatting is applied depending on the rule. The only exception to this general format is rule 4a – there were some cases where a hyphen, the character used for splitting, interupts the same feature. Luckily, when this occurs, the splitting results in the two parts of the feature containing fewer than 5 words – and all other features contain greater than 5 words. We used the wordcount function available in the ngram package to determine when this was the case, and to paste the two halves back together as they were intended. This is another inelegant solution, but one that works for the data we have.

for(i in 1:dim(wayback)[1]){
  #Rule 1
  if(!is.na(wayback$SplitRule[i]) & wayback$SplitRule[i]==1){
    templist<-trimws(unlist(str_split(wayback$Description[i], "•")))
    templist<-templist[2:length(templist)]
    for(j in 1:length(templist)){
      wayback[[paste("Feature_", j, sep="")]][i]<-templist[j]
    }
    rm(templist)
  }
  #Rule 2
  if(!is.na(wayback$SplitRule[i]) & wayback$SplitRule[i]==2){
    templist<-trimws(unlist(str_split(wayback$Description[i], "★")))
    templist<-templist[2:length(templist)]
    for(j in 1:length(templist)){
      wayback[[paste("Feature_", j, sep="")]][i]<-templist[j]
    }
    rm(templist)
  }
  #Rule 3
  if(!is.na(wayback$SplitRule[i]) & wayback$SplitRule[i]==3){
    templist<-trimws(unlist(str_split(wayback$Description[i], "☆")))
    templist<-templist[2:length(templist)]
    for(j in 1:length(templist)){
      wayback[[paste("Feature_", j, sep="")]][i]<-templist[j]
    }
    rm(templist)
  }
  #Rule 4a
  if(!is.na(wayback$SplitRule[i]) & wayback$SplitRule[i]=="4a"){
    temptxt<-unlist(str_split(wayback$Description[i], "Features:"))[2]
    templist<-trimws(unlist(str_split(temptxt, "- ")))
    templist<-templist[templist!=""]
    wordcounts<-rep(NA, length(templist))
    for(j in 1:length(wordcounts)){
      wordcounts[j]<-wordcount(templist[j])
    }
    for(j in 1:length(templist)){
      if(wordcounts[j]<=5){
        templist[j+1]<-paste(templist[j], templist[j+1], sep=" - ")
      }
    }
    templist<-templist[wordcounts>5]
    for(j in 1:length(templist)){
      wayback[[paste("Feature_", j, sep="")]][i]<-templist[j]
    }
    rm(temptxt)
    rm(templist)
  }
  #Rule 4b
  if(!is.na(wayback$SplitRule[i]) & wayback$SplitRule[i]=="4b"){
    templist<-trimws(unlist(str_split(wayback$Description[i], "- ")))
    templist<-templist[2:length(templist)]
    for(j in 1:length(templist)){
      wayback[[paste("Feature_", j, sep="")]][i]<-templist[j]
    }
    rm(templist)
  }
  #Rule 5a
  if(!is.na(wayback$SplitRule[i]) & wayback$SplitRule[i]=="5a"){
    templist<-trimws(unlist(str_split(wayback$Description[i], "[.]")))
    templist<-templist[templist!=""]
    templist<-templist[2:length(templist)]
    for(j in 1:length(templist)){
      if(!grepl(":", templist[j])){
        templist[j-1]<-paste(templist[j-1], templist[j], sep=". ")
      }
    }
    templist<-templist[grepl(":", templist)]
    for(j in 1:length(templist)){
      wayback[[paste("Feature_", j, sep="")]][i]<-templist[j]
    }
    rm(templist)
  }
  #Rule 5b
  if(!is.na(wayback$SplitRule[i]) & wayback$SplitRule[i]=="5b"){
    temptxt<-unlist(str_split(wayback$Description[i], "Features"))[2]
    templist<-trimws(unlist(str_split(temptxt, "[.]")))
    templist<-templist[templist!=""]
    for(j in 1:length(templist)){
      if(!grepl(":", templist[j])){
        templist[j-1]<-paste(templist[j-1], templist[j], sep=". ")
      }
    }
    templist<-templist[grepl(":", templist)]
    for(j in 1:length(templist)){
      wayback[[paste("Feature_", j, sep="")]][i]<-templist[j]
    }
    rm(temptxt)
    rm(templist)
  }
}

Once that step is complete, we need only to clean up the “features” we don’t need for our analyses and to count up the number of features in each observation. First, we’ll get rid of links to testimonials in the press. Then, we’ll remove device compatibility text. Finally, some of the last features on a list contained the sentiment “enjoy”, which is not relevant to the function of the application.

wayback[!is.na(wayback) & wayback=="Path in The New York Times: http://nyti.ms/cVs14J"]<-NA
wayback[!is.na(wayback) & wayback=="Path in The Los Angeles Times: http://lat.ms/bszNxl"]<-NA
wayback[!is.na(wayback) & wayback=="Path in Wired: http://bit.ly/9mQePM"]<-NA
wayback[!is.na(wayback) & wayback=="Path in Forbes: http://bit.ly/cC2x62"]<-NA

wayback$Feature_5<-gsub("Compatible with any iPhone or iPod Touch running iOS 4.0 or above.Enjoy Path on iPhone or on the Web at www.path.com.Learn More:", "", wayback$Feature_5)
wayback$Feature_5<-gsub("Compatible with any iPhone or iPod Touch running iOS 4.0 or above.Enjoy Path on iPhone or on the Web at www.path.com.Path Elsewhere:", "", wayback$Feature_5)
wayback$Feature_5<-gsub("Compatible with any iPhone or iPod Touch running iOS 4.0 or above.Enjoy Path on iPhone or on the Web at www.path.com.Path in The New York Times: http://nyti.ms/cVs14JPath in The Los Angeles Times: http://lat.ms/bszNxlPath in Wired: http://bit.ly/9mQePMPath in Forbes: http://bit.ly/cC2x62", "", wayback$Feature_5)
wayback$Feature_5<-gsub("Compatible with any iPhone running iOS 4.0 or above.Enjoy Path on iPhone or on the Web at www.path.com.", "", wayback$Feature_5)
wayback$Feature_7<-gsub("Compatible with any iPhone or iPod Touch running iOS 4.0 or above.Enjoy Path on iPhone or on the Web at www.path.com.Learn More:", "", wayback$Feature_7)

wayback$Feature_8<-gsub(".Enjoy.", ".", wayback$Feature_8)
wayback$Feature_10<-gsub(".Enjoy.", ".", wayback$Feature_10)

Lastly, we need to count the number of features contained in each observation. Since the variable names in the data frame contain the ordered feature number, and data are missing from higher feature numbers after the last feature in the order, we need only add to the count iteratively:

wayback$nFeatures<-0
wayback$nFeatures[!is.na(wayback$Feature_1)]<-1
wayback$nFeatures[!is.na(wayback$Feature_2)]<-2
wayback$nFeatures[!is.na(wayback$Feature_3)]<-3
wayback$nFeatures[!is.na(wayback$Feature_4)]<-4
wayback$nFeatures[!is.na(wayback$Feature_5)]<-5
wayback$nFeatures[!is.na(wayback$Feature_6)]<-6
wayback$nFeatures[!is.na(wayback$Feature_7)]<-7
wayback$nFeatures[!is.na(wayback$Feature_8)]<-8
wayback$nFeatures[!is.na(wayback$Feature_9)]<-9
wayback$nFeatures[!is.na(wayback$Feature_10)]<-10
wayback$nFeatures[!is.na(wayback$Feature_11)]<-11
wayback$nFeatures[!is.na(wayback$Feature_12)]<-12
wayback$nFeatures[!is.na(wayback$Feature_13)]<-13
wayback$nFeatures[!is.na(wayback$Feature_14)]<-14

Here are some summary statistics for the nFeatures variable we just created:

## [1] "Mean: 7.0414201183432"
## [1] "Standard Deviation: 5.1192389566291"
## 
##  0  1  2  5  6  7  8 11 13 
## 17 38  3  3  9  3 38  1 57

As an added cleaning step, we decided to remove eleven cases from the data set. These are observations which occur at a later time, but on the same date as another observation. This step helps simplify how we should handle the data for the vector autoregression to follow. In all cases, when two observations occur on the same day, the content is identical between them.

wayback$Duplicated<-0
for(i in 2:dim(wayback)[1]){
  if(wayback$Date[i]==wayback$Date[i-1]){
    wayback$Duplicated[i]<-1
  }
}
wayback<-wayback[wayback$Duplicated==0,1:21]

Finally, we saved this data frame as “Wayback_Complete.rds” – the same data file should be available for download from the same repository where this markdown file is located. This file can be read back into R using the command readRDS().

saveRDS(wayback, "Wayback_Complete.rds")

Vector autoregression

To relate the content of Tweets pertaining to Path to the content of Path’s App Store feature lists, we used modeling technique called “vector autoregression”, or VAR. This technique is a relatively straight-forward kind of time-series analysis in which one measure can be shown to relate to another over time. For this portion of the analysis, we’ll rely on the following packages:

library(parallel)
library(vars)
library(zoo)
library(stm)

Once again, the parallel package allows us to make use of multiple processing cores on modern machines. The parallel code in this section has been verified to work on most Macintosh and Linux systems. If you are using a Windows machine, I suggest that you remove parallel code.

This section is broken down into two parts: cleaning/imputation and VAR results.


Cleaning and imputation

We have two sets of data that we would like to relate to one another: Path’s Wayback Features and the Twitter STM results. We’ll start by reading them in and converting the timestamp to the appropriate Date format – the Date conversion is important as this is used to index the measures over time.

wayback<-readRDS("Wayback_Complete.rds")
wayback$Date<-as.Date(as.character(wayback$Date))
twitter<-readRDS("Path_Data_noHTMLorTAG.rds")
twitter$date<-as.Date(twitter$Date_EST)

On the Twitter side, we have a number of documents which were removed in the cleaning steps of the STM procedures. The text processor results were saved and there is a variable in that result called docs.removed which tells R the indices of our Twitter data which were removed in cleaning.

textPro<-readRDS("textPro.rds")
twitter<-twitter[-(textPro$docs.removed),]

Next, lets read in the STM results – here, we’re specifically looking for the theta matrix, or the fitted values for each document to each topic.

stmresults<-readRDS("stm.rds")
stmresults<-stmresults$theta

Now that we have the STM results read in, we can determine which of the 75 modeled topics each observation fits best to. The variable Topic_fit stores the maximum theta value, and the variable Topic then stores the integer (1 through 75) which corresponds to the topic which that theta belongs to for the observation. There may be more efficient means to store this information, but it runs quick enough for our purposes.

twitter$Topic<-NA
twitter$Topic_fit<-NA
for(i in 1:dim(twitter)[1]){
  twitter$Topic_fit[i]<-max(stmresults[i,])
  for(j in 1:dim(stmresults)[2]){
    if(stmresults[i,j]==twitter$Topic_fit[i]){
      twitter$Topic[i]<-j
    }
  }
}

Now, for the VAR analyses, we will need to have the data condensed to some uniform period of time (i.e., the time gaps between observations need to be uniform). To do that, we created averages at three time points: daily, weekly, and monthly. We’ll start first with daily.

Because the Wayback data is already ordered by date with the earliest first, we can begin the construction of our days dataframe by iterating from the first entry (November 15, 2010; the day Path became available on the App Store) to the end of our data collection period (December 31, 2017).

days<-as.data.frame(cbind(wayback$Date[1]:as.Date("2017-12-31")))
names(days)[1]<-"date"
days$date<-as.Date(days$date, origin="1970-1-1")

Then, to determine how many Twitter observations we have to each day, let’s create a new variable in the dataframe:

days$twit_nObs<-0
for(i in 1:dim(days)[1]){
  days$twit_nObs[i]<-length(twitter$date[twitter$date==days$date[i]])
}

Next, we need a function that will return the proportion of tweets on a given day that were categorized into one of the eight topic categories which appear in both Path’s App Store Features and the STM results. Those topics are:

  1. Aesthetics
  2. Cost
  3. Expression
  4. Integration
  5. Lifelogging
  6. Network
  7. Privacy
  8. Simplicity

When there are no observations for a day, we’ll return NA for now. Finally, to make the function work across all topic categories, we’ll add a type parameter which tells the function which topic numbers to check for (shown in the comments). Here is the resulting function:

#aesthetics <- 39, 48
#cost <- 30
#expression <- 2
#integration <- 3, 63, 73
#lifelogging <- 18
#network <- 13, 44, 56
#privacy <- 42, 72
#simplicity <- 46

twitFillFn<-function(i, type){
  if(!(days$date[i] %in% twitter$date)){
    return(NA)
  }
  if((days$date[i] %in% twitter$date) & type==1){
    return(length(twitter$Topic[(twitter$Topic==39 | twitter$Topic==48) & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
  if((days$date[i] %in% twitter$date) & type==2){
    return(length(twitter$Topic[twitter$Topic==30 & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
  if((days$date[i] %in% twitter$date) & type==3){
    return(length(twitter$Topic[twitter$Topic==2 & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
  if((days$date[i] %in% twitter$date) & type==4){
    return(length(twitter$Topic[(twitter$Topic==3 | twitter$Topic==63 | twitter$Topic==73) & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
  if((days$date[i] %in% twitter$date) & type==5){
    return(length(twitter$Topic[twitter$Topic==18 & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
  if((days$date[i] %in% twitter$date) & type==6){
    return(length(twitter$Topic[(twitter$Topic==13 | twitter$Topic==44 | twitter$Topic==56) & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
  if((days$date[i] %in% twitter$date) & type==7){
    return(length(twitter$Topic[(twitter$Topic==42 | twitter$Topic==72) & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
  if((days$date[i] %in% twitter$date) & type==8){
    return(length(twitter$Topic[(twitter$Topic==46) & twitter$date==days$date[i]])/days$twit_nObs[i])
  }
}

The data can be added to the frame in parallel using the mclapply function from the parallel package. If you would like to run this code on a Windows machine, change mclapply to lapply and remove the mc.cores parameter from the function call.

days$twit_aesthetics<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=1, mc.cores=detectCores()))
days$twit_cost<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=2, mc.cores=detectCores()))
days$twit_expression<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=3, mc.cores=detectCores()))
days$twit_integration<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=4, mc.cores=detectCores()))
days$twit_lifelogging<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=5, mc.cores=detectCores()))
days$twit_network<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=6, mc.cores=detectCores()))
days$twit_privacy<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=7, mc.cores=detectCores()))
days$twit_simplicity<-unlist(mclapply(1:dim(days)[1], twitFillFn, type=8, mc.cores=detectCores()))

Next, we’ll need the Path coded results at a daily level. Although a count of the Wayback observations at the daily level is not necessary for calculating the proportions, it may come in handy when attempting to diagnose any problems with the imputation process.

days$path_nObs<-0
for(i in 1:dim(days)[1]){
  days$path_nObs[i]<-length(wayback$Date[wayback$Date==days$date[i]])
}

Before we pull in the data to the daily level, we’ll first make sure that all of the NA observations are replaced with 0. The topic categories are represented by a number indicated in the comments.

#aesthetics <- 2
#cost <- 9
#expression <- 6
#integration <- 11
#lifelogging <- 5
#network <- 1
#privacy <- 7
#simplicity <- 8

wayback$pFeat_t2[is.na(wayback$pFeat_t2)]<-0
wayback$pFeat_t9[is.na(wayback$pFeat_t9)]<-0
wayback$pFeat_t6[is.na(wayback$pFeat_t6)]<-0
wayback$pFeat_t11[is.na(wayback$pFeat_t11)]<-0
wayback$pFeat_t5[is.na(wayback$pFeat_t5)]<-0
wayback$pFeat_t1[is.na(wayback$pFeat_t1)]<-0
wayback$pFeat_t7[is.na(wayback$pFeat_t7)]<-0
wayback$pFeat_t8[is.na(wayback$pFeat_t8)]<-0

Next, we’ll create a similar filling function for the path results:

pathFillFn<-function(i, type){
  if(!(days$date[i] %in% wayback$Date)){
    return(NA)
  }
  if((days$date[i] %in% wayback$Date) & type==1){
    return(wayback$pFeat_t2[wayback$Date==days$date[i]])
  }
  if((days$date[i] %in% wayback$Date) & type==2){
    return(wayback$pFeat_t9[wayback$Date==days$date[i]])
  }
  if((days$date[i] %in% wayback$Date) & type==3){
    return(wayback$pFeat_t6[wayback$Date==days$date[i]])
  }
  if((days$date[i] %in% wayback$Date) & type==4){
    return(wayback$pFeat_t11[wayback$Date==days$date[i]])
  }
  if((days$date[i] %in% wayback$Date) & type==5){
    return(wayback$pFeat_t5[wayback$Date==days$date[i]])
  }
  if((days$date[i] %in% wayback$Date) & type==6){
    return(wayback$pFeat_t1[wayback$Date==days$date[i]])
  }
  if((days$date[i] %in% wayback$Date) & type==7){
    return(wayback$pFeat_t7[wayback$Date==days$date[i]])
  }
  if((days$date[i] %in% wayback$Date) & type==8){
    return(wayback$pFeat_t8[wayback$Date==days$date[i]])
  }
}

… and we can fill the data frame using the following parallel code:

days$path_aesthetics<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=1, mc.cores=detectCores()))
days$path_cost<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=2, mc.cores=detectCores()))
days$path_expression<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=3, mc.cores=detectCores()))
days$path_integration<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=4, mc.cores=detectCores()))
days$path_lifelogging<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=5, mc.cores=detectCores()))
days$path_network<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=6, mc.cores=detectCores()))
days$path_privacy<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=7, mc.cores=detectCores()))
days$path_simplicity<-unlist(mclapply(1:dim(days)[1], pathFillFn, type=8, mc.cores=detectCores()))

Let’s save our progress on the days data frame:

saveRDS(days, "Days_w_NA.rds")

The last thing to do with the daily data is to impute values for the missing observations. VAR cannot handle gaps in the time series data, so we need to fill in those gaps. We’ll use two techniques: for the Twitter data, we’ll use a linear interpolation method (described in the paper). For the Path Wayback data, we’ll use a feed-forward approach which simply takes the most recent observation and imputes that value forward in time until another observation is found. These use the na.approx and na.locf functions (respectively) from the zoo package.

days$twit_aesthetics<-na.approx(days$twit_aesthetics)
days$twit_cost<-na.approx(days$twit_cost)
days$twit_expression<-na.approx(days$twit_expression)
days$twit_integration<-na.approx(days$twit_integration)
days$twit_lifelogging<-na.approx(days$twit_lifelogging)
days$twit_network<-na.approx(days$twit_network)
days$twit_privacy<-na.approx(days$twit_privacy)
days$twit_simplicity<-na.approx(days$twit_simplicity)

days$path_aesthetics<-na.locf(days$path_aesthetics)
days$path_cost<-na.locf(days$path_cost)
days$path_expression<-na.locf(days$path_expression)
days$path_integration<-na.locf(days$path_integration)
days$path_lifelogging<-na.locf(days$path_lifelogging)
days$path_network<-na.locf(days$path_network)
days$path_privacy<-na.locf(days$path_privacy)
days$path_simplicity<-na.locf(days$path_simplicity)

And let’s save these results as well.

saveRDS(days, "Days_wo_NA.rds")

Moving to the weekly level now, we’ll start again by creating a data frame filled with dates. We can sequence from the start to the finish, counting every 7th day to get the start of the week (Mondays). Then we’ll add 6 to each entry to get the end of the week (Sundays). Using the Monday-to-Sunday week, we are left with no partial weeks at either end of the data collection period.

weeks<-as.data.frame(cbind(seq(wayback$Date[1], as.Date("2017-12-31"), by=7)))
names(weeks)[1]<-"week_start"
weeks$week_start<-as.Date(weeks$week_start, origin="1970-1-1")
weeks$week_end<-weeks$week_start+6

Then, we’ll need to count the number of observations in the Twitter data set which occur within each week:

weeks$twit_nObs<-0
for(i in 1:dim(weeks)[1]){
  weeks$twit_nObs[i]<-length(twitter$date[twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])
}

After that, we can write a similar data retrieval function which returns the proportion of Tweets categorized in each topic – the difference between this and the daily one is that this one looks within a range of dates rather than matching exactly one day. We then filled the data frame like before.

twitFillFn<-function(i, type){
  if(weeks$twit_nObs[i]==0){
    return(NA)
  }
  if(weeks$twit_nObs[i]!=0 & type==1){
    return(length(twitter$Topic[(twitter$Topic==39 | twitter$Topic==48) & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
  if(weeks$twit_nObs[i]!=0 & type==2){
    return(length(twitter$Topic[twitter$Topic==30 & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
  if(weeks$twit_nObs[i]!=0 & type==3){
    return(length(twitter$Topic[twitter$Topic==2 & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
  if(weeks$twit_nObs[i]!=0 & type==4){
    return(length(twitter$Topic[(twitter$Topic==3 | twitter$Topic==63 | twitter$Topic==73) & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
  if(weeks$twit_nObs[i]!=0 & type==5){
    return(length(twitter$Topic[twitter$Topic==18 & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
  if(weeks$twit_nObs[i]!=0 & type==6){
    return(length(twitter$Topic[(twitter$Topic==13 | twitter$Topic==44 | twitter$Topic==56) & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
  if(weeks$twit_nObs[i]!=0 & type==7){
    return(length(twitter$Topic[(twitter$Topic==42 | twitter$Topic==72) & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
  if(weeks$twit_nObs[i]!=0 & type==8){
    return(length(twitter$Topic[(twitter$Topic==46) & twitter$date>=weeks$week_start[i] & twitter$date<=weeks$week_end[i]])/weeks$twit_nObs[i])
  }
}

weeks$twit_aesthetics<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=1, mc.cores=detectCores()))
weeks$twit_cost<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=2, mc.cores=detectCores()))
weeks$twit_expression<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=3, mc.cores=detectCores()))
weeks$twit_integration<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=4, mc.cores=detectCores()))
weeks$twit_lifelogging<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=5, mc.cores=detectCores()))
weeks$twit_network<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=6, mc.cores=detectCores()))
weeks$twit_privacy<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=7, mc.cores=detectCores()))
weeks$twit_simplicity<-unlist(mclapply(1:dim(weeks)[1], twitFillFn, type=8, mc.cores=detectCores()))

Turning to the Path Wayback data, we’ll use a very similar set of expresions to fill the weekly frame:

weeks$path_nObs<-0
for(i in 1:dim(weeks)[1]){
  weeks$path_nObs[i]<-length(wayback$Date[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]])
}

pathFillFn<-function(i, type){
  if(weeks$path_nObs[i]==0){
    return(NA)
  }
  if(weeks$path_nObs[i]!=0 & type==1){
    return(mean(wayback$pFeat_t2[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
  if(weeks$path_nObs[i]!=0 & type==2){
    return(mean(wayback$pFeat_t9[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
  if(weeks$path_nObs[i]!=0 & type==3){
    return(mean(wayback$pFeat_t6[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
  if(weeks$path_nObs[i]!=0 & type==4){
    return(mean(wayback$pFeat_t11[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
  if(weeks$path_nObs[i]!=0 & type==5){
    return(mean(wayback$pFeat_t5[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
  if(weeks$path_nObs[i]!=0 & type==6){
    return(mean(wayback$pFeat_t1[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
  if(weeks$path_nObs[i]!=0 & type==7){
    return(mean(wayback$pFeat_t7[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
  if(weeks$path_nObs[i]!=0 & type==8){
    return(mean(wayback$pFeat_t8[wayback$Date>=weeks$week_start[i] & wayback$Date<=weeks$week_end[i]]))
  }
}

weeks$path_aesthetics<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=1, mc.cores=detectCores()))
weeks$path_cost<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=2, mc.cores=detectCores()))
weeks$path_expression<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=3, mc.cores=detectCores()))
weeks$path_integration<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=4, mc.cores=detectCores()))
weeks$path_lifelogging<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=5, mc.cores=detectCores()))
weeks$path_network<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=6, mc.cores=detectCores()))
weeks$path_privacy<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=7, mc.cores=detectCores()))
weeks$path_simplicity<-unlist(mclapply(1:dim(weeks)[1], pathFillFn, type=8, mc.cores=detectCores()))

We can save the results here:

saveRDS(weeks, "Weeks_w_NA.rds")

Lastly, we’ll need to use the same imputation methods as before to fill in the weekly gaps. There should be fewer missing observations in each dataset as we expand the time frame.

weeks$twit_aesthetics<-na.approx(weeks$twit_aesthetics)
weeks$twit_cost<-na.approx(weeks$twit_cost)
weeks$twit_expression<-na.approx(weeks$twit_expression)
weeks$twit_integration<-na.approx(weeks$twit_integration)
weeks$twit_lifelogging<-na.approx(weeks$twit_lifelogging)
weeks$twit_network<-na.approx(weeks$twit_network)
weeks$twit_privacy<-na.approx(weeks$twit_privacy)
weeks$twit_simplicity<-na.approx(weeks$twit_simplicity)

weeks$path_aesthetics<-na.locf(weeks$path_aesthetics)
weeks$path_cost<-na.locf(weeks$path_cost)
weeks$path_expression<-na.locf(weeks$path_expression)
weeks$path_integration<-na.locf(weeks$path_integration)
weeks$path_lifelogging<-na.locf(weeks$path_lifelogging)
weeks$path_network<-na.locf(weeks$path_network)
weeks$path_privacy<-na.locf(weeks$path_privacy)
weeks$path_simplicity<-na.locf(weeks$path_simplicity)

And we’ll save these results as well:

saveRDS(weeks, "Weeks_wo_NA.rds")

Finally, the construction of the monthly data frame is very similar to the weekly method. THe difference here is that the first month is expanded from November 15th to November 1st. Because each month contains a different amount of days, we can’t simply add an integer to retrieve the start and end of each month. So instead, we’ll utilize the as.yearmon function from the zoo package to determine the start of each month, then we’ll subtract 1 day from the next month to determine the end. This code was created with the help of the zoo user manual – credit to the authors, Achim Zeileis and Gabor Grothendieck.

months<-as.Date("2010-11-1")
next.month<-function(d){as.Date(as.yearmon(d)+1/12)+as.numeric(d-as.Date(as.yearmon(d)))}
while(months[length(months)]<as.Date("2017-12-1")){
  months<-c(months, next.month(months[length(months)]))
}
months<-as.data.frame(cbind(months))
names(months)[1]<-"month_start"
months$month_start<-as.Date(months$month_start, origin="1970-1-1")
months$month_end<-as.Date(unlist(lapply(months$month_start, next.month))-1, origin="1970-1-1")

We can then fill the data frame using similar functions to the weekly data:

months$twit_nObs<-0
for(i in 1:dim(months)[1]){
  months$twit_nObs[i]<-length(twitter$date[twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])
}

twitFillFn<-function(i, type){
  if(months$twit_nObs[i]==0){
    return(NA)
  }
  if(months$twit_nObs[i]!=0 & type==1){
    return(length(twitter$Topic[(twitter$Topic==39 | twitter$Topic==48) & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
  if(months$twit_nObs[i]!=0 & type==2){
    return(length(twitter$Topic[twitter$Topic==30 & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
  if(months$twit_nObs[i]!=0 & type==3){
    return(length(twitter$Topic[twitter$Topic==2 & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
  if(months$twit_nObs[i]!=0 & type==4){
    return(length(twitter$Topic[(twitter$Topic==3 | twitter$Topic==63 | twitter$Topic==73) & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
  if(months$twit_nObs[i]!=0 & type==5){
    return(length(twitter$Topic[twitter$Topic==18 & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
  if(months$twit_nObs[i]!=0 & type==6){
    return(length(twitter$Topic[(twitter$Topic==13 | twitter$Topic==44 | twitter$Topic==56) & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
  if(months$twit_nObs[i]!=0 & type==7){
    return(length(twitter$Topic[(twitter$Topic==42 | twitter$Topic==72) & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
  if(months$twit_nObs[i]!=0 & type==8){
    return(length(twitter$Topic[(twitter$Topic==46) & twitter$date>=months$month_start[i] & twitter$date<=months$month_end[i]])/months$twit_nObs[i])
  }
}

months$twit_aesthetics<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=1, mc.cores=detectCores()))
months$twit_cost<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=2, mc.cores=detectCores()))
months$twit_expression<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=3, mc.cores=detectCores()))
months$twit_integration<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=4, mc.cores=detectCores()))
months$twit_lifelogging<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=5, mc.cores=detectCores()))
months$twit_network<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=6, mc.cores=detectCores()))
months$twit_privacy<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=7, mc.cores=detectCores()))
months$twit_simplicity<-unlist(mclapply(1:dim(months)[1], twitFillFn, type=8, mc.cores=detectCores()))

months$path_nObs<-0
for(i in 1:dim(months)[1]){
  months$path_nObs[i]<-length(wayback$Date[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]])
}

pathFillFn<-function(i, type){
  if(months$path_nObs[i]==0){
    return(NA)
  }
  if(months$path_nObs[i]!=0 & type==1){
    return(mean(wayback$pFeat_t2[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
  if(months$path_nObs[i]!=0 & type==2){
    return(mean(wayback$pFeat_t9[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
  if(months$path_nObs[i]!=0 & type==3){
    return(mean(wayback$pFeat_t6[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
  if(months$path_nObs[i]!=0 & type==4){
    return(mean(wayback$pFeat_t11[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
  if(months$path_nObs[i]!=0 & type==5){
    return(mean(wayback$pFeat_t5[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
  if(months$path_nObs[i]!=0 & type==6){
    return(mean(wayback$pFeat_t1[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
  if(months$path_nObs[i]!=0 & type==7){
    return(mean(wayback$pFeat_t7[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
  if(months$path_nObs[i]!=0 & type==8){
    return(mean(wayback$pFeat_t8[wayback$Date>=months$month_start[i] & wayback$Date<=months$month_end[i]]))
  }
}

months$path_aesthetics<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=1, mc.cores=detectCores()))
months$path_cost<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=2, mc.cores=detectCores()))
months$path_expression<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=3, mc.cores=detectCores()))
months$path_integration<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=4, mc.cores=detectCores()))
months$path_lifelogging<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=5, mc.cores=detectCores()))
months$path_network<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=6, mc.cores=detectCores()))
months$path_privacy<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=7, mc.cores=detectCores()))
months$path_simplicity<-unlist(mclapply(1:dim(months)[1], pathFillFn, type=8, mc.cores=detectCores()))

… and save those results:

saveRDS(months, "Months_w_NA.rds")

Then finally let’s impute missing values:

months$twit_aesthetics<-na.approx(months$twit_aesthetics)
months$twit_cost<-na.approx(months$twit_cost)
months$twit_expression<-na.approx(months$twit_expression)
months$twit_integration<-na.approx(months$twit_integration)
months$twit_lifelogging<-na.approx(months$twit_lifelogging)
months$twit_network<-na.approx(months$twit_network)
months$twit_privacy<-na.approx(months$twit_privacy)
months$twit_simplicity<-na.approx(months$twit_simplicity)

months$path_aesthetics<-na.locf(months$path_aesthetics)
months$path_cost<-na.locf(months$path_cost)
months$path_expression<-na.locf(months$path_expression)
months$path_integration<-na.locf(months$path_integration)
months$path_lifelogging<-na.locf(months$path_lifelogging)
months$path_network<-na.locf(months$path_network)
months$path_privacy<-na.locf(months$path_privacy)
months$path_simplicity<-na.locf(months$path_simplicity)

… and save those results as well:

saveRDS(weeks, "Months_wo_NA.rds")

Now that the data is prepared at the various time intervals, let’s begin the analysis.


VAR results

Turning to the vector autoregression models, we’ll start by reading in the vars library and the data frames we created before – This means that once you run the cleaning steps once, you can return to this part of the code without re-running all of the cleaning code.

library(vars)
days<-readRDS("Days_wo_NA.rds")
weeks<-readRDS("Weeks_wo_NA.rds")
months<-readRDS("Months_wo_NA.rds")

Next, the VAR function which estimates the models requires that the data be of class ts, or time series. This is a relatively straight-forward data conversion using the ts function in base R:

d_aest<-ts(cbind(days$twit_aesthetics, days$path_aesthetics))
d_cost<-ts(cbind(days$twit_cost, days$path_cost))
d_expr<-ts(cbind(days$twit_expression, days$path_expression))
d_inte<-ts(cbind(days$twit_integration, days$path_integration))
d_life<-ts(cbind(days$twit_lifelogging, days$path_lifelogging))
d_netw<-ts(cbind(days$twit_network, days$path_network))
d_priv<-ts(cbind(days$twit_privacy, days$path_privacy))
d_simp<-ts(cbind(days$twit_simplicity, days$path_simplicity))

w_aest<-ts(cbind(weeks$twit_aesthetics, weeks$path_aesthetics))
w_cost<-ts(cbind(weeks$twit_cost, weeks$path_cost))
w_expr<-ts(cbind(weeks$twit_expression, weeks$path_expression))
w_inte<-ts(cbind(weeks$twit_integration, weeks$path_integration))
w_life<-ts(cbind(weeks$twit_lifelogging, weeks$path_lifelogging))
w_netw<-ts(cbind(weeks$twit_network, weeks$path_network))
w_priv<-ts(cbind(weeks$twit_privacy, weeks$path_privacy))
w_simp<-ts(cbind(weeks$twit_simplicity, weeks$path_simplicity))

m_aest<-ts(cbind(months$twit_aesthetics, months$path_aesthetics))
m_cost<-ts(cbind(months$twit_cost, months$path_cost))
m_expr<-ts(cbind(months$twit_expression, months$path_expression))
m_inte<-ts(cbind(months$twit_integration, months$path_integration))
m_life<-ts(cbind(months$twit_lifelogging, months$path_lifelogging))
m_netw<-ts(cbind(months$twit_network, months$path_network))
m_priv<-ts(cbind(months$twit_privacy, months$path_privacy))
m_simp<-ts(cbind(months$twit_simplicity, months$path_simplicity))

Finally, all that’s left to do is to estimate the models. The VAR function takes two parameters: the time series data for the topics we created above, and a lag.max parameter which tells the model how far back in time (maximally) the model should look for relationships between variables. The model will automatically select the best fitting lag amount between 1 and the set maximum. For our models, we elected to use a 3-month maximum – no model selected the maximum as the best fit, so we feel that this struck a good balance between exhaustive searching and computation time. The models are estimated below.

Daily:

dvar_aest<-VAR(d_aest, lag.max=90)
dvar_cost<-VAR(d_cost, lag.max=90)
dvar_expr<-VAR(d_expr, lag.max=90)
dvar_inte<-VAR(d_inte, lag.max=90)
dvar_life<-VAR(d_life, lag.max=90)
dvar_netw<-VAR(d_netw, lag.max=90)
dvar_priv<-VAR(d_priv, lag.max=90)
dvar_simp<-VAR(d_simp, lag.max=90)

Weekly:

wvar_aest<-VAR(w_aest, lag.max=12)
wvar_cost<-VAR(w_cost, lag.max=12)
wvar_expr<-VAR(w_expr, lag.max=12)
wvar_inte<-VAR(w_inte, lag.max=12)
wvar_life<-VAR(w_life, lag.max=12)
wvar_netw<-VAR(w_netw, lag.max=12)
wvar_priv<-VAR(w_priv, lag.max=12)
wvar_simp<-VAR(w_simp, lag.max=12)

Monthly:

mvar_aest<-VAR(m_aest, lag.max=3)
mvar_cost<-VAR(m_cost, lag.max=3)
mvar_expr<-VAR(m_expr, lag.max=3)
mvar_inte<-VAR(m_inte, lag.max=3)
mvar_life<-VAR(m_life, lag.max=3)
mvar_netw<-VAR(m_netw, lag.max=3)
mvar_priv<-VAR(m_priv, lag.max=3)
mvar_simp<-VAR(m_simp, lag.max=3)

For the sake of conciseness, we reported only the significant results for the “simplicity” topic in the paper. Additional results could be printed by changing the model called in the summary function. In all of the results, “Series 1” refers to Twitter and “Series 2” refers to Path’s App Store features.

summary(dvar_simp)$varresult
## $Series.1
## 
## Call:
## lm(formula = y ~ -1 + ., data = datamat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04183 -0.00400 -0.00400 -0.00233  0.33100 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## Series.1.l1  0.1134778  0.0196406   5.778 8.48e-09 ***
## Series.2.l1 -0.0099460  0.0761659  -0.131    0.896    
## Series.1.l2  0.0320691  0.0197786   1.621    0.105    
## Series.2.l2 -0.0034142  0.0761027  -0.045    0.964    
## const        0.0039992  0.0005429   7.366 2.35e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02157 on 2597 degrees of freedom
## Multiple R-squared:  0.0178, Adjusted R-squared:  0.01628 
## F-statistic: 11.76 on 4 and 2597 DF,  p-value: 1.805e-09
## 
## 
## $Series.2
## 
## Call:
## lm(formula = y ~ -1 + ., data = datamat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124494 -0.000407 -0.000126  0.000049  0.208665 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## Series.1.l1 -0.0324961  0.0050519  -6.432 1.49e-10 ***
## Series.2.l1  0.9796967  0.0195910  50.007  < 2e-16 ***
## Series.1.l2 -0.0201325  0.0050874  -3.957 7.78e-05 ***
## Series.2.l2  0.0166523  0.0195748   0.851  0.39501    
## const        0.0004072  0.0001397   2.916  0.00358 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005549 on 2597 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9947 
## F-statistic: 1.215e+05 on 4 and 2597 DF,  p-value: < 2.2e-16
summary(wvar_simp)$varresult
## $Series.1
## 
## Call:
## lm(formula = y ~ -1 + ., data = datamat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.012514 -0.004792 -0.003160 -0.000005  0.139697 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Series.1.l1  0.054060   0.052438   1.031    0.303    
## Series.2.l1 -0.013049   0.007890  -1.654    0.099 .  
## const        0.004792   0.000796   6.020 4.22e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01141 on 368 degrees of freedom
## Multiple R-squared:  0.01189,    Adjusted R-squared:  0.006517 
## F-statistic: 2.213 on 2 and 368 DF,  p-value: 0.1108
## 
## 
## $Series.2
## 
## Call:
## lm(formula = y ~ -1 + ., data = datamat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.123747 -0.001201 -0.000950  0.001253  0.209100 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Series.1.l1 -0.020060   0.068303  -0.294    0.769    
## Series.2.l1  0.980371   0.010278  95.387   <2e-16 ***
## const        0.001201   0.001037   1.158    0.248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01486 on 368 degrees of freedom
## Multiple R-squared:  0.962,  Adjusted R-squared:  0.9618 
## F-statistic:  4664 on 2 and 368 DF,  p-value: < 2.2e-16
summary(mvar_simp)$varresult
## $Series.1
## 
## Call:
## lm(formula = y ~ -1 + ., data = datamat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.012514 -0.004792 -0.003160 -0.000005  0.139697 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Series.1.l1  0.054060   0.052438   1.031    0.303    
## Series.2.l1 -0.013049   0.007890  -1.654    0.099 .  
## const        0.004792   0.000796   6.020 4.22e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01141 on 368 degrees of freedom
## Multiple R-squared:  0.01189,    Adjusted R-squared:  0.006517 
## F-statistic: 2.213 on 2 and 368 DF,  p-value: 0.1108
## 
## 
## $Series.2
## 
## Call:
## lm(formula = y ~ -1 + ., data = datamat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.123747 -0.001201 -0.000950  0.001253  0.209100 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## Series.1.l1 -0.020060   0.068303  -0.294    0.769    
## Series.2.l1  0.980371   0.010278  95.387   <2e-16 ***
## const        0.001201   0.001037   1.158    0.248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01486 on 368 degrees of freedom
## Multiple R-squared:  0.962,  Adjusted R-squared:  0.9618 
## F-statistic:  4664 on 2 and 368 DF,  p-value: < 2.2e-16

Citations

  • Lee, M., & Mimno, D. (2014). Low-dimensional embeddings for interpretable anchor-based topic inference. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 1319–1328. https://doi.org/10.3115/v1/d14-1138

  • Salton, G. (1971). The SMART retrieval system—Experiments in automatic document processing. Upper Saddle River, NJ: Prentice-Hall.

  • Sweitzer, M. D., & Shulman, H. C. (2018). The effects of metacognition in survey research: Experimental, cross-sectional, and content-analytic evidence. Public Opinion Quarterly, 82(4), 745–768. https://doi.org/10.1093/poq/nfy034


Packages used in this document

(up-to-date as of June 15, 2020):

install.packages(c("SnowballC", "rjson", "igraph", "rsvd", "tm", "wordcloud", "irr", "geometry", "colorspace", "Rtsne", "vars", "showtext", "stm", "openxlsx", "ngram", "rvest", "stringr", "readxl", "knitr", "zoo"), repos="https://cloud.r-project.org", dependencies=TRUE)